Skip to content

fix(eval): optimize screenshot request path and require normalized type coordinates#658

Open
David Shan (DavidBShan) wants to merge 5 commits intobrowseros-ai:mainfrom
DavidBShan:main
Open

fix(eval): optimize screenshot request path and require normalized type coordinates#658
David Shan (DavidBShan) wants to merge 5 commits intobrowseros-ai:mainfrom
DavidBShan:main

Conversation

@DavidBShan
Copy link
Copy Markdown

@DavidBShan David Shan (DavidBShan) commented Apr 8, 2026

Summary

This PR strengthens the eval orchestrator-executor flow for Clado runs by improving task bootstrap context, progress detection, screenshot handling, and delegation guardrails.

What Changed

  • Added an orchestrator bootstrap path that captures initial page state and seeds the orchestration loop with stronger context.
  • Added page progress snapshot/signals logic to detect no-op or repeated executor behavior and prevent unproductive action loops.
  • Improved delegation observability by attaching screenshots/evidence to delegation steps and tightening status reporting.
  • Updated the Clado action executor to:
    • optimize screenshot request payloads for inference reliability,
    • require normalized x/y coordinates for type actions (0-1000).
  • Updated orchestrator-executor types/config/constants and weekly eval config defaults to align with the new runtime behavior.

Why

  • Reduce stalled delegations and repeated no-progress actions.
  • Improve executor predictability for text-entry actions.
  • Increase signal quality for debugging and grading by preserving better screenshot evidence and initial context.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 8, 2026

All contributors have signed the CLA. Thank you!
Posted by the CLA Assistant Lite bot.

@DavidBShan
Copy link
Copy Markdown
Author

I have read the CLA Document and I hereby sign the CLA

@felarof99
Copy link
Copy Markdown
Contributor

Greptile (@greptileai)

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 8, 2026

Greptile Summary

This PR strengthens the orchestrator-executor eval loop by adding a bootstrap phase (initial page snapshot + seed context), per-action page-progress signals for no-op detection, screenshot attachment to delegation results, and tightened type action validation requiring normalized x/y coordinates. It also introduces screenshot size optimization for Clado inference requests.

Confidence Score: 4/5

Mostly safe to merge; two functional issues should be addressed before production eval runs.

The PNG palette compression (P1) risks degrading visual model accuracy for screenshots that exceed 2 MB — directly counteracting the stated reliability goal. The missing bodyText progress signal (P1) can cause premature blocked results on content-updating actions like Load More clicks. Both are straightforward fixes. All other changes are well-implemented.

clado-action-executor.ts (PNG palette compression) and page-progress.ts (missing bodyText signal)

Vulnerabilities

No security concerns identified. The Clado API key is passed as a Bearer token in-process from resolved config; no secrets are logged. The evaluate_script injection point uses the browser's own CDP sandbox. The parseActionsFromRawResponse JSON parser is wrapped in a try/catch, preventing malformed model output from crashing the executor.

Important Files Changed

Filename Overview
packages/browseros-agent/apps/eval/src/agents/orchestrator-executor/clado-action-executor.ts Core executor file; adds screenshot size optimization, per-action progress checks, and required x/y for type actions. PNG palette compression and missing bodyText progress signal warrant attention.
packages/browseros-agent/apps/eval/src/agents/orchestrator-executor/page-progress.ts New file capturing page state signals; bodyText is captured but not compared in pageProgressSignals, risking false no-progress on content-only page updates.
packages/browseros-agent/apps/eval/src/agents/orchestrator-executor/bootstrap.ts New file providing initial page state snapshot and formatted observation string; straightforward and well-implemented.
packages/browseros-agent/apps/eval/src/agents/orchestrator-executor/index.ts Wires bootstrap capture and screenshot attachment into the delegation factory; clean integration with existing patterns.
packages/browseros-agent/apps/eval/src/agents/orchestrator-executor/orchestrator-agent.ts Extensive system prompt improvements and bootstrap prompt seeding; logic changes look correct.
packages/browseros-agent/apps/eval/src/agents/orchestrator-executor/types.ts Adds OrchestratorBootstrap type and screenshotDataUrl to ExecutorResult; clean additions.
packages/browseros-agent/apps/eval/src/constants.ts Adds NO_PROGRESS_MAX_STREAK and DOM_STATE_HASH_MAX_CHARS constants; straightforward additions.
packages/browseros-agent/apps/eval/src/capture/screenshot.ts Adds captureBase64 method for in-memory screenshot retrieval; clean and well-structured.
packages/browseros-agent/apps/eval/configs/browseros-oe-clado-weekly.json Adds maxTurns: 5 to the orchestrator config for the weekly eval run.

Sequence Diagram

sequenceDiagram
    participant OE as OrchestratorExecutorEvaluator
    participant B as Bootstrap
    participant OA as OrchestratorAgent
    participant EF as ExecutorFactory
    participant CE as CladoActionExecutor
    participant PP as PageProgress

    OE->>B: captureInitialPageSnapshot(browser, pageId)
    B-->>OE: BootstrapPageSnapshot
    OE->>OA: agent.run(OrchestratorBootstrap)
    loop Orchestration turns (maxTurns=5)
        OA->>EF: delegate(instruction)
        EF->>CE: executor.execute(instruction)
        loop Per action (max 5)
            CE->>CE: captureScreenshotBase64()
            CE->>CE: optimizeScreenshotForRequest() if >2MB
            CE->>CE: requestActionPrediction(Clado API)
            CE->>PP: capturePageProgressSnapshot() before
            CE->>CE: executeAction()
            CE->>PP: capturePageProgressSnapshot() after
            PP-->>CE: pageProgressSignals(before, after)
            alt no signals and same action streak >= 3
                CE-->>EF: status=blocked
            end
        end
        CE-->>EF: ExecutorResult
        EF->>EF: captureBase64() post-delegation screenshot
        EF-->>OA: status, observation, screenshotDataUrl
        alt task complete
            OA-->>OE: final answer text
        end
    end
    OE-->>OE: saveMetadata()
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: packages/browseros-agent/apps/eval/src/agents/orchestrator-executor/clado-action-executor.ts
Line: 113-118

Comment:
**PNG palette mode may degrade inference quality**

`palette: true` quantizes the image to an indexed 8-bit color table (~204 colors at `quality: 80`). Web screenshots contain gradients, shadows, and anti-aliased text that easily require thousands of colors — palette quantization introduces dithering artifacts that can make UI text unreadable for a visual model, working against the stated goal of "optimizing for inference reliability."

Consider using JPEG instead, which provides continuous-tone compression without the color-count ceiling:

```suggestion
    const resized = await sharp(original)
      .resize({ width, withoutEnlargement: true })
      .jpeg({ quality: 80, mozjpeg: true })
      .toBuffer()
```

And update the fallback similarly:
```typescript
const smallest = await sharp(original)
  .resize({ width: 512, withoutEnlargement: true })
  .jpeg({ quality: 70, mozjpeg: true })
  .toBuffer()
```

If the downstream Clado endpoint requires PNG, use `.png({ compressionLevel: 9 })` without `palette`.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: packages/browseros-agent/apps/eval/src/agents/orchestrator-executor/page-progress.ts
Line: 158-186

Comment:
**`bodyText` change not included as a progress signal**

`pageProgressSignals` checks URL, scroll, focus, popup state, and form control values — but it never compares `before.bodyText` vs `after.bodyText`. A "Load More" click, an AJAX content refresh, or a page that updates its main text without changing the URL/scroll/form state will produce zero signals. The executor's action will be mis-classified as "no progress" and the streak counter will advance, potentially triggering an early `blocked` result on a perfectly productive action.

```suggestion
export function pageProgressSignals(
  before: PageProgressSnapshot | null,
  after: PageProgressSnapshot | null,
): string[] {
  if (!before || !after) return []

  const signals: string[] = []
  if (before.url !== after.url || before.title !== after.title) {
    signals.push('navigation')
  }
  if (
    Math.abs(after.scrollX - before.scrollX) >= 8 ||
    Math.abs(after.scrollY - before.scrollY) >= 8
  ) {
    signals.push('scroll')
  }
  if (JSON.stringify(before.active) !== JSON.stringify(after.active)) {
    signals.push('focus')
  }
  if (
    before.expandedCount !== after.expandedCount ||
    before.popupCount !== after.popupCount
  ) {
    signals.push('open_state')
  }
  if (JSON.stringify(before.controls) !== JSON.stringify(after.controls)) {
    signals.push('value')
  }
  if (before.bodyText !== after.bodyText) {
    signals.push('content')
  }
  return signals
}
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: packages/browseros-agent/apps/eval/src/agents/orchestrator-executor/clado-action-executor.ts
Line: 94-96

Comment:
**Coordinate range: implementation caps at 999, error message says 0-1000**

`clampNormalized` clamps to `[0, 999]`, but the `type` action's thrown error (line 599) tells callers the valid range is "0-1000". A caller passing exactly `1000` will have it silently clamped to `999`, which is a ~0.1% positional shift. The discrepancy should be consistent — either change the clamp to `Math.min(1000, …)` or update the error message to "0-999".

```suggestion
function clampNormalized(value: number): number {
  return Math.min(1000, Math.max(0, Math.round(value)))
}
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "fix(eval): avoid invalid multimodal orch..." | Re-trigger Greptile

Comment on lines +113 to +118
.png({ compressionLevel: 9, palette: true, quality: 80 })
.toBuffer()
if (resized.byteLength <= CLADO_IMAGE_MAX_BYTES) {
return resized.toString('base64')
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 PNG palette mode may degrade inference quality

palette: true quantizes the image to an indexed 8-bit color table (~204 colors at quality: 80). Web screenshots contain gradients, shadows, and anti-aliased text that easily require thousands of colors — palette quantization introduces dithering artifacts that can make UI text unreadable for a visual model, working against the stated goal of "optimizing for inference reliability."

Consider using JPEG instead, which provides continuous-tone compression without the color-count ceiling:

Suggested change
.png({ compressionLevel: 9, palette: true, quality: 80 })
.toBuffer()
if (resized.byteLength <= CLADO_IMAGE_MAX_BYTES) {
return resized.toString('base64')
}
}
const resized = await sharp(original)
.resize({ width, withoutEnlargement: true })
.jpeg({ quality: 80, mozjpeg: true })
.toBuffer()

And update the fallback similarly:

const smallest = await sharp(original)
  .resize({ width: 512, withoutEnlargement: true })
  .jpeg({ quality: 70, mozjpeg: true })
  .toBuffer()

If the downstream Clado endpoint requires PNG, use .png({ compressionLevel: 9 }) without palette.

Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/browseros-agent/apps/eval/src/agents/orchestrator-executor/clado-action-executor.ts
Line: 113-118

Comment:
**PNG palette mode may degrade inference quality**

`palette: true` quantizes the image to an indexed 8-bit color table (~204 colors at `quality: 80`). Web screenshots contain gradients, shadows, and anti-aliased text that easily require thousands of colors — palette quantization introduces dithering artifacts that can make UI text unreadable for a visual model, working against the stated goal of "optimizing for inference reliability."

Consider using JPEG instead, which provides continuous-tone compression without the color-count ceiling:

```suggestion
    const resized = await sharp(original)
      .resize({ width, withoutEnlargement: true })
      .jpeg({ quality: 80, mozjpeg: true })
      .toBuffer()
```

And update the fallback similarly:
```typescript
const smallest = await sharp(original)
  .resize({ width: 512, withoutEnlargement: true })
  .jpeg({ quality: 70, mozjpeg: true })
  .toBuffer()
```

If the downstream Clado endpoint requires PNG, use `.png({ compressionLevel: 9 })` without `palette`.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +158 to +186
export function pageProgressSignals(
before: PageProgressSnapshot | null,
after: PageProgressSnapshot | null,
): string[] {
if (!before || !after) return []

const signals: string[] = []
if (before.url !== after.url || before.title !== after.title) {
signals.push('navigation')
}
if (
Math.abs(after.scrollX - before.scrollX) >= 8 ||
Math.abs(after.scrollY - before.scrollY) >= 8
) {
signals.push('scroll')
}
if (JSON.stringify(before.active) !== JSON.stringify(after.active)) {
signals.push('focus')
}
if (
before.expandedCount !== after.expandedCount ||
before.popupCount !== after.popupCount
) {
signals.push('open_state')
}
if (JSON.stringify(before.controls) !== JSON.stringify(after.controls)) {
signals.push('value')
}
return signals
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 bodyText change not included as a progress signal

pageProgressSignals checks URL, scroll, focus, popup state, and form control values — but it never compares before.bodyText vs after.bodyText. A "Load More" click, an AJAX content refresh, or a page that updates its main text without changing the URL/scroll/form state will produce zero signals. The executor's action will be mis-classified as "no progress" and the streak counter will advance, potentially triggering an early blocked result on a perfectly productive action.

Suggested change
export function pageProgressSignals(
before: PageProgressSnapshot | null,
after: PageProgressSnapshot | null,
): string[] {
if (!before || !after) return []
const signals: string[] = []
if (before.url !== after.url || before.title !== after.title) {
signals.push('navigation')
}
if (
Math.abs(after.scrollX - before.scrollX) >= 8 ||
Math.abs(after.scrollY - before.scrollY) >= 8
) {
signals.push('scroll')
}
if (JSON.stringify(before.active) !== JSON.stringify(after.active)) {
signals.push('focus')
}
if (
before.expandedCount !== after.expandedCount ||
before.popupCount !== after.popupCount
) {
signals.push('open_state')
}
if (JSON.stringify(before.controls) !== JSON.stringify(after.controls)) {
signals.push('value')
}
return signals
export function pageProgressSignals(
before: PageProgressSnapshot | null,
after: PageProgressSnapshot | null,
): string[] {
if (!before || !after) return []
const signals: string[] = []
if (before.url !== after.url || before.title !== after.title) {
signals.push('navigation')
}
if (
Math.abs(after.scrollX - before.scrollX) >= 8 ||
Math.abs(after.scrollY - before.scrollY) >= 8
) {
signals.push('scroll')
}
if (JSON.stringify(before.active) !== JSON.stringify(after.active)) {
signals.push('focus')
}
if (
before.expandedCount !== after.expandedCount ||
before.popupCount !== after.popupCount
) {
signals.push('open_state')
}
if (JSON.stringify(before.controls) !== JSON.stringify(after.controls)) {
signals.push('value')
}
if (before.bodyText !== after.bodyText) {
signals.push('content')
}
return signals
}
Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/browseros-agent/apps/eval/src/agents/orchestrator-executor/page-progress.ts
Line: 158-186

Comment:
**`bodyText` change not included as a progress signal**

`pageProgressSignals` checks URL, scroll, focus, popup state, and form control values — but it never compares `before.bodyText` vs `after.bodyText`. A "Load More" click, an AJAX content refresh, or a page that updates its main text without changing the URL/scroll/form state will produce zero signals. The executor's action will be mis-classified as "no progress" and the streak counter will advance, potentially triggering an early `blocked` result on a perfectly productive action.

```suggestion
export function pageProgressSignals(
  before: PageProgressSnapshot | null,
  after: PageProgressSnapshot | null,
): string[] {
  if (!before || !after) return []

  const signals: string[] = []
  if (before.url !== after.url || before.title !== after.title) {
    signals.push('navigation')
  }
  if (
    Math.abs(after.scrollX - before.scrollX) >= 8 ||
    Math.abs(after.scrollY - before.scrollY) >= 8
  ) {
    signals.push('scroll')
  }
  if (JSON.stringify(before.active) !== JSON.stringify(after.active)) {
    signals.push('focus')
  }
  if (
    before.expandedCount !== after.expandedCount ||
    before.popupCount !== after.popupCount
  ) {
    signals.push('open_state')
  }
  if (JSON.stringify(before.controls) !== JSON.stringify(after.controls)) {
    signals.push('value')
  }
  if (before.bodyText !== after.bodyText) {
    signals.push('content')
  }
  return signals
}
```

How can I resolve this? If you propose a fix, please make it concise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants