fix(eval): optimize screenshot request path and require normalized type coordinates#658
fix(eval): optimize screenshot request path and require normalized type coordinates#658David Shan (DavidBShan) wants to merge 5 commits intobrowseros-ai:mainfrom
Conversation
|
All contributors have signed the CLA. Thank you! |
|
I have read the CLA Document and I hereby sign the CLA |
Greptile SummaryThis PR strengthens the orchestrator-executor eval loop by adding a bootstrap phase (initial page snapshot + seed context), per-action page-progress signals for no-op detection, screenshot attachment to delegation results, and tightened Confidence Score: 4/5Mostly safe to merge; two functional issues should be addressed before production eval runs. The PNG palette compression (P1) risks degrading visual model accuracy for screenshots that exceed 2 MB — directly counteracting the stated reliability goal. The missing bodyText progress signal (P1) can cause premature blocked results on content-updating actions like Load More clicks. Both are straightforward fixes. All other changes are well-implemented. clado-action-executor.ts (PNG palette compression) and page-progress.ts (missing bodyText signal)
|
| Filename | Overview |
|---|---|
| packages/browseros-agent/apps/eval/src/agents/orchestrator-executor/clado-action-executor.ts | Core executor file; adds screenshot size optimization, per-action progress checks, and required x/y for type actions. PNG palette compression and missing bodyText progress signal warrant attention. |
| packages/browseros-agent/apps/eval/src/agents/orchestrator-executor/page-progress.ts | New file capturing page state signals; bodyText is captured but not compared in pageProgressSignals, risking false no-progress on content-only page updates. |
| packages/browseros-agent/apps/eval/src/agents/orchestrator-executor/bootstrap.ts | New file providing initial page state snapshot and formatted observation string; straightforward and well-implemented. |
| packages/browseros-agent/apps/eval/src/agents/orchestrator-executor/index.ts | Wires bootstrap capture and screenshot attachment into the delegation factory; clean integration with existing patterns. |
| packages/browseros-agent/apps/eval/src/agents/orchestrator-executor/orchestrator-agent.ts | Extensive system prompt improvements and bootstrap prompt seeding; logic changes look correct. |
| packages/browseros-agent/apps/eval/src/agents/orchestrator-executor/types.ts | Adds OrchestratorBootstrap type and screenshotDataUrl to ExecutorResult; clean additions. |
| packages/browseros-agent/apps/eval/src/constants.ts | Adds NO_PROGRESS_MAX_STREAK and DOM_STATE_HASH_MAX_CHARS constants; straightforward additions. |
| packages/browseros-agent/apps/eval/src/capture/screenshot.ts | Adds captureBase64 method for in-memory screenshot retrieval; clean and well-structured. |
| packages/browseros-agent/apps/eval/configs/browseros-oe-clado-weekly.json | Adds maxTurns: 5 to the orchestrator config for the weekly eval run. |
Sequence Diagram
sequenceDiagram
participant OE as OrchestratorExecutorEvaluator
participant B as Bootstrap
participant OA as OrchestratorAgent
participant EF as ExecutorFactory
participant CE as CladoActionExecutor
participant PP as PageProgress
OE->>B: captureInitialPageSnapshot(browser, pageId)
B-->>OE: BootstrapPageSnapshot
OE->>OA: agent.run(OrchestratorBootstrap)
loop Orchestration turns (maxTurns=5)
OA->>EF: delegate(instruction)
EF->>CE: executor.execute(instruction)
loop Per action (max 5)
CE->>CE: captureScreenshotBase64()
CE->>CE: optimizeScreenshotForRequest() if >2MB
CE->>CE: requestActionPrediction(Clado API)
CE->>PP: capturePageProgressSnapshot() before
CE->>CE: executeAction()
CE->>PP: capturePageProgressSnapshot() after
PP-->>CE: pageProgressSignals(before, after)
alt no signals and same action streak >= 3
CE-->>EF: status=blocked
end
end
CE-->>EF: ExecutorResult
EF->>EF: captureBase64() post-delegation screenshot
EF-->>OA: status, observation, screenshotDataUrl
alt task complete
OA-->>OE: final answer text
end
end
OE-->>OE: saveMetadata()
Prompt To Fix All With AI
This is a comment left during a code review.
Path: packages/browseros-agent/apps/eval/src/agents/orchestrator-executor/clado-action-executor.ts
Line: 113-118
Comment:
**PNG palette mode may degrade inference quality**
`palette: true` quantizes the image to an indexed 8-bit color table (~204 colors at `quality: 80`). Web screenshots contain gradients, shadows, and anti-aliased text that easily require thousands of colors — palette quantization introduces dithering artifacts that can make UI text unreadable for a visual model, working against the stated goal of "optimizing for inference reliability."
Consider using JPEG instead, which provides continuous-tone compression without the color-count ceiling:
```suggestion
const resized = await sharp(original)
.resize({ width, withoutEnlargement: true })
.jpeg({ quality: 80, mozjpeg: true })
.toBuffer()
```
And update the fallback similarly:
```typescript
const smallest = await sharp(original)
.resize({ width: 512, withoutEnlargement: true })
.jpeg({ quality: 70, mozjpeg: true })
.toBuffer()
```
If the downstream Clado endpoint requires PNG, use `.png({ compressionLevel: 9 })` without `palette`.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: packages/browseros-agent/apps/eval/src/agents/orchestrator-executor/page-progress.ts
Line: 158-186
Comment:
**`bodyText` change not included as a progress signal**
`pageProgressSignals` checks URL, scroll, focus, popup state, and form control values — but it never compares `before.bodyText` vs `after.bodyText`. A "Load More" click, an AJAX content refresh, or a page that updates its main text without changing the URL/scroll/form state will produce zero signals. The executor's action will be mis-classified as "no progress" and the streak counter will advance, potentially triggering an early `blocked` result on a perfectly productive action.
```suggestion
export function pageProgressSignals(
before: PageProgressSnapshot | null,
after: PageProgressSnapshot | null,
): string[] {
if (!before || !after) return []
const signals: string[] = []
if (before.url !== after.url || before.title !== after.title) {
signals.push('navigation')
}
if (
Math.abs(after.scrollX - before.scrollX) >= 8 ||
Math.abs(after.scrollY - before.scrollY) >= 8
) {
signals.push('scroll')
}
if (JSON.stringify(before.active) !== JSON.stringify(after.active)) {
signals.push('focus')
}
if (
before.expandedCount !== after.expandedCount ||
before.popupCount !== after.popupCount
) {
signals.push('open_state')
}
if (JSON.stringify(before.controls) !== JSON.stringify(after.controls)) {
signals.push('value')
}
if (before.bodyText !== after.bodyText) {
signals.push('content')
}
return signals
}
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: packages/browseros-agent/apps/eval/src/agents/orchestrator-executor/clado-action-executor.ts
Line: 94-96
Comment:
**Coordinate range: implementation caps at 999, error message says 0-1000**
`clampNormalized` clamps to `[0, 999]`, but the `type` action's thrown error (line 599) tells callers the valid range is "0-1000". A caller passing exactly `1000` will have it silently clamped to `999`, which is a ~0.1% positional shift. The discrepancy should be consistent — either change the clamp to `Math.min(1000, …)` or update the error message to "0-999".
```suggestion
function clampNormalized(value: number): number {
return Math.min(1000, Math.max(0, Math.round(value)))
}
```
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "fix(eval): avoid invalid multimodal orch..." | Re-trigger Greptile
| .png({ compressionLevel: 9, palette: true, quality: 80 }) | ||
| .toBuffer() | ||
| if (resized.byteLength <= CLADO_IMAGE_MAX_BYTES) { | ||
| return resized.toString('base64') | ||
| } | ||
| } |
There was a problem hiding this comment.
PNG palette mode may degrade inference quality
palette: true quantizes the image to an indexed 8-bit color table (~204 colors at quality: 80). Web screenshots contain gradients, shadows, and anti-aliased text that easily require thousands of colors — palette quantization introduces dithering artifacts that can make UI text unreadable for a visual model, working against the stated goal of "optimizing for inference reliability."
Consider using JPEG instead, which provides continuous-tone compression without the color-count ceiling:
| .png({ compressionLevel: 9, palette: true, quality: 80 }) | |
| .toBuffer() | |
| if (resized.byteLength <= CLADO_IMAGE_MAX_BYTES) { | |
| return resized.toString('base64') | |
| } | |
| } | |
| const resized = await sharp(original) | |
| .resize({ width, withoutEnlargement: true }) | |
| .jpeg({ quality: 80, mozjpeg: true }) | |
| .toBuffer() |
And update the fallback similarly:
const smallest = await sharp(original)
.resize({ width: 512, withoutEnlargement: true })
.jpeg({ quality: 70, mozjpeg: true })
.toBuffer()If the downstream Clado endpoint requires PNG, use .png({ compressionLevel: 9 }) without palette.
Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/browseros-agent/apps/eval/src/agents/orchestrator-executor/clado-action-executor.ts
Line: 113-118
Comment:
**PNG palette mode may degrade inference quality**
`palette: true` quantizes the image to an indexed 8-bit color table (~204 colors at `quality: 80`). Web screenshots contain gradients, shadows, and anti-aliased text that easily require thousands of colors — palette quantization introduces dithering artifacts that can make UI text unreadable for a visual model, working against the stated goal of "optimizing for inference reliability."
Consider using JPEG instead, which provides continuous-tone compression without the color-count ceiling:
```suggestion
const resized = await sharp(original)
.resize({ width, withoutEnlargement: true })
.jpeg({ quality: 80, mozjpeg: true })
.toBuffer()
```
And update the fallback similarly:
```typescript
const smallest = await sharp(original)
.resize({ width: 512, withoutEnlargement: true })
.jpeg({ quality: 70, mozjpeg: true })
.toBuffer()
```
If the downstream Clado endpoint requires PNG, use `.png({ compressionLevel: 9 })` without `palette`.
How can I resolve this? If you propose a fix, please make it concise.| export function pageProgressSignals( | ||
| before: PageProgressSnapshot | null, | ||
| after: PageProgressSnapshot | null, | ||
| ): string[] { | ||
| if (!before || !after) return [] | ||
|
|
||
| const signals: string[] = [] | ||
| if (before.url !== after.url || before.title !== after.title) { | ||
| signals.push('navigation') | ||
| } | ||
| if ( | ||
| Math.abs(after.scrollX - before.scrollX) >= 8 || | ||
| Math.abs(after.scrollY - before.scrollY) >= 8 | ||
| ) { | ||
| signals.push('scroll') | ||
| } | ||
| if (JSON.stringify(before.active) !== JSON.stringify(after.active)) { | ||
| signals.push('focus') | ||
| } | ||
| if ( | ||
| before.expandedCount !== after.expandedCount || | ||
| before.popupCount !== after.popupCount | ||
| ) { | ||
| signals.push('open_state') | ||
| } | ||
| if (JSON.stringify(before.controls) !== JSON.stringify(after.controls)) { | ||
| signals.push('value') | ||
| } | ||
| return signals |
There was a problem hiding this comment.
bodyText change not included as a progress signal
pageProgressSignals checks URL, scroll, focus, popup state, and form control values — but it never compares before.bodyText vs after.bodyText. A "Load More" click, an AJAX content refresh, or a page that updates its main text without changing the URL/scroll/form state will produce zero signals. The executor's action will be mis-classified as "no progress" and the streak counter will advance, potentially triggering an early blocked result on a perfectly productive action.
| export function pageProgressSignals( | |
| before: PageProgressSnapshot | null, | |
| after: PageProgressSnapshot | null, | |
| ): string[] { | |
| if (!before || !after) return [] | |
| const signals: string[] = [] | |
| if (before.url !== after.url || before.title !== after.title) { | |
| signals.push('navigation') | |
| } | |
| if ( | |
| Math.abs(after.scrollX - before.scrollX) >= 8 || | |
| Math.abs(after.scrollY - before.scrollY) >= 8 | |
| ) { | |
| signals.push('scroll') | |
| } | |
| if (JSON.stringify(before.active) !== JSON.stringify(after.active)) { | |
| signals.push('focus') | |
| } | |
| if ( | |
| before.expandedCount !== after.expandedCount || | |
| before.popupCount !== after.popupCount | |
| ) { | |
| signals.push('open_state') | |
| } | |
| if (JSON.stringify(before.controls) !== JSON.stringify(after.controls)) { | |
| signals.push('value') | |
| } | |
| return signals | |
| export function pageProgressSignals( | |
| before: PageProgressSnapshot | null, | |
| after: PageProgressSnapshot | null, | |
| ): string[] { | |
| if (!before || !after) return [] | |
| const signals: string[] = [] | |
| if (before.url !== after.url || before.title !== after.title) { | |
| signals.push('navigation') | |
| } | |
| if ( | |
| Math.abs(after.scrollX - before.scrollX) >= 8 || | |
| Math.abs(after.scrollY - before.scrollY) >= 8 | |
| ) { | |
| signals.push('scroll') | |
| } | |
| if (JSON.stringify(before.active) !== JSON.stringify(after.active)) { | |
| signals.push('focus') | |
| } | |
| if ( | |
| before.expandedCount !== after.expandedCount || | |
| before.popupCount !== after.popupCount | |
| ) { | |
| signals.push('open_state') | |
| } | |
| if (JSON.stringify(before.controls) !== JSON.stringify(after.controls)) { | |
| signals.push('value') | |
| } | |
| if (before.bodyText !== after.bodyText) { | |
| signals.push('content') | |
| } | |
| return signals | |
| } |
Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/browseros-agent/apps/eval/src/agents/orchestrator-executor/page-progress.ts
Line: 158-186
Comment:
**`bodyText` change not included as a progress signal**
`pageProgressSignals` checks URL, scroll, focus, popup state, and form control values — but it never compares `before.bodyText` vs `after.bodyText`. A "Load More" click, an AJAX content refresh, or a page that updates its main text without changing the URL/scroll/form state will produce zero signals. The executor's action will be mis-classified as "no progress" and the streak counter will advance, potentially triggering an early `blocked` result on a perfectly productive action.
```suggestion
export function pageProgressSignals(
before: PageProgressSnapshot | null,
after: PageProgressSnapshot | null,
): string[] {
if (!before || !after) return []
const signals: string[] = []
if (before.url !== after.url || before.title !== after.title) {
signals.push('navigation')
}
if (
Math.abs(after.scrollX - before.scrollX) >= 8 ||
Math.abs(after.scrollY - before.scrollY) >= 8
) {
signals.push('scroll')
}
if (JSON.stringify(before.active) !== JSON.stringify(after.active)) {
signals.push('focus')
}
if (
before.expandedCount !== after.expandedCount ||
before.popupCount !== after.popupCount
) {
signals.push('open_state')
}
if (JSON.stringify(before.controls) !== JSON.stringify(after.controls)) {
signals.push('value')
}
if (before.bodyText !== after.bodyText) {
signals.push('content')
}
return signals
}
```
How can I resolve this? If you propose a fix, please make it concise.25eb9c3 to
823b515
Compare
3461afc to
a0a3782
Compare
Summary
This PR strengthens the eval orchestrator-executor flow for Clado runs by improving task bootstrap context, progress detection, screenshot handling, and delegation guardrails.
What Changed
x/ycoordinates fortypeactions (0-1000).Why