feat: deterministic eval graders (AGI SDK + WebArena-Infinity) by shivammittal274 · Pull Request #664 · browseros-ai/BrowserOS

shivammittal274 · 2026-04-09T06:50:30Z

Summary

Two new benchmark integrations with programmatic grading — no LLM judge
AGI SDK / REAL Bench: 52 tasks across 11 consumer app clones (DoorDash, Amazon, Gmail, Airbnb, etc.). Grader fetches state diff from /finish page via MCP, Python verifier checks exact values via jmespath.
WebArena-Infinity: 50 hard tasks across 13 SaaS app clones (Gmail, GitLab, Linear, Figma, etc.). Fresh app server started per task per worker via InfinityAppManager. Python verifier checks /api/state.
Both support 10 parallel workers with isolated ports
CI workflow updated: pip install agisdk, clone webarena-infinity, OPENROUTER_API_KEY secret

Initial Results (Kimi K2.5)

Benchmark	Tasks	Pass Rate
AGI SDK / REAL	52	30.8%
WebArena-Infinity Hard	50	42.0%

Test plan

AGI SDK: 3-task local smoke test — 3/3 passed
AGI SDK: 52-task CI run — 30.8% pass rate, grading deterministic
Infinity: 3-task local smoke test — 2/3 passed (1 legit agent failure)
Infinity: 50-task CI run — 42% pass rate, grading deterministic
Run with Claude Opus 4.6 via OpenRouter

🤖 Generated with Claude Code

Two new benchmark integrations with programmatic grading — no LLM judge. AGI SDK / REAL Bench (52 tasks): - 11 React/Next.js clones of consumer apps (DoorDash, Amazon, Gmail, etc.) - Grader navigates browser to /finish, extracts state diff from <pre> tag - Python verifier checks exact values via jmespath queries WebArena-Infinity (50 hard tasks): - 13 LLM-generated SaaS clones (Gmail, GitLab, Linear, Figma, etc.) - InfinityAppManager starts fresh app server per task per worker - Python verifier calls /api/state and asserts on JSON state Infrastructure: - GraderInput extended with mcpUrl + infinityAppUrl for parallel workers - Each worker gets isolated ports (no cross-worker state contamination) - CI workflow: pip install agisdk, clone webarena-infinity repo

greptile-apps · 2026-04-09T06:54:46Z

Greptile Summary

Adds two deterministic benchmark integrations (AGI SDK / REAL Bench and WebArena-Infinity) with programmatic graders, per-task app-server lifecycle management, and CI wiring. The implementation is well-structured with clear separation between the TypeScript runner and Python verifiers.

P1 – infinity-app-manager.ts: waitForReady uses resp.ok (2xx only); Python servers that return 404 at / will be treated as "not ready," timing out every Infinity task before the agent even starts.

Confidence Score: 4/5

Mostly safe — one P1 in the server-readiness check could silently fail all Infinity tasks; fix is a one-liner before merging.

The P1 waitForReady issue (resp.ok rejects non-2xx) is a genuine runtime defect that would cause InfinityAppManager to timeout on every task if the Python server doesn't serve 200 at root. The two P2s are non-blocking quality issues.

packages/browseros-agent/apps/eval/src/runner/infinity-app-manager.ts (P1 readiness check)

Vulnerabilities

No security concerns identified. Verifier scripts are loaded from a trusted local path (WEBARENA_INFINITY_DIR) set via CI environment variable rather than from user-controlled input.

Important Files Changed

Filename	Overview
packages/browseros-agent/apps/eval/src/runner/infinity-app-manager.ts	New class managing per-task Infinity server lifecycle; `waitForReady` only accepts 2xx responses which may false-timeout on servers that return 404 at root
packages/browseros-agent/apps/eval/src/graders/benchmark/agisdk-state-diff.ts	New AGI SDK grader navigates /finish, extracts state diff via MCP, then runs Python evaluator; message-scanning fallback in extractStartUrl is unreachable dead code
packages/browseros-agent/apps/eval/src/graders/benchmark/infinity-state.ts	New WebArena-Infinity grader resolving app URL and running Python verifier; uses Bun-specific import.meta.dir while sibling grader uses import.meta.dirname
packages/browseros-agent/apps/eval/scripts/infinity-evaluate.py	Python verifier runner that exits with code 1 on error while printing JSON to stdout, causing the TypeScript caller to throw before reading the detailed error message
packages/browseros-agent/apps/eval/scripts/agisdk-evaluate.py	Python AGI SDK evaluator bridge; correctly redirects stdout to stderr during evaluation and restores it on exception
packages/browseros-agent/apps/eval/src/runner/task-executor.ts	Extended to start/stop InfinityAppManager per task and thread infinityAppUrl to graders; workerIndex back-calculation from hardcoded constant noted in prior thread
.github/workflows/eval-weekly.yml	Adds Python deps and unconditionally clones webarena-infinity repo for every eval run regardless of config; adds timeout/continue-on-error to trend report step
packages/browseros-agent/apps/eval/scripts/build-infinity-dataset.py	Dataset builder incrementing port per app; field name app_port in generated output vs app_base_port read by executor already flagged in prior thread
packages/browseros-agent/apps/eval/scripts/build-agisdk-dataset.py	Reads agisdk task list, filters infeasible/llm-eval tasks, and emits correctly-formatted JSONL; straightforward and correct
packages/browseros-agent/apps/eval/src/graders/registry.ts	Registers two new deterministic graders and exports their classes; no issues

Sequence Diagram

sequenceDiagram
    participant Runner as TaskRunner (Worker N)
    participant TE as TaskExecutor
    participant IAM as InfinityAppManager
    participant Browser as BrowserOS/MCP
    participant Agent as Agent
    participant PY as Python Verifier

    Runner->>TE: executeTask(task)
    TE->>IAM: startApp(appName)
    IAM->>IAM: spawn python3 server.py --port (base+N)
    IAM->>IAM: waitForReady (polls HTTP)
    IAM-->>TE: appUrl (http://localhost:PORT)

    TE->>Browser: navigate_page(appUrl)
    TE->>Agent: run(task, pageId)
    Agent-->>TE: AgentResult

    alt AGI SDK task
        TE->>Browser: navigate_page(/finish)
        Browser-->>TE: pre JSON (env_state)
        TE->>PY: agisdk-evaluate.py (stdin JSON)
        PY-->>TE: {reward, pass, per_criterion}
    else Infinity task
        TE->>PY: infinity-evaluate.py (app_server_url, verifier_path)
        PY->>IAM: GET /api/state
        IAM-->>PY: app state
        PY-->>TE: {pass, reward, message}
    end

    TE->>IAM: stop()
    IAM->>IAM: SIGTERM then SIGKILL after 3s
    TE-->>Runner: TaskResult

Prompt To Fix All With AI

This is a comment left during a code review.
Path: packages/browseros-agent/apps/eval/src/runner/infinity-app-manager.ts
Line: 72

Comment:
**`waitForReady` rejects non-2xx HTTP responses as "not ready"**

`resp.ok` is `true` only for 2xx status codes. Many Python web frameworks (Flask, FastAPI) return `404` at `/` when no root route is defined, even after the server is fully started. This will cause `waitForReady` to spin through all 30 attempts and throw "server not ready" despite the server being alive — failing every Infinity task.

The check should accept any HTTP response (i.e., treat a network error as "not ready" but any HTTP status as "ready"):

```suggestion
        if (resp.status > 0) return
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: packages/browseros-agent/apps/eval/scripts/infinity-evaluate.py
Line: 64-69

Comment:
**Verifier error detail lost on non-zero exit code**

When an exception is caught here, the script prints a JSON error object to stdout, then calls `sys.exit(1)`. However, `infinity-state.ts` throws immediately on a non-zero exit code without reading stdout, so the full verifier traceback is silently discarded. Consider writing the traceback to stderr instead:

```suggestion
    except Exception as e:
        sys.stderr.write(f"Verifier error: {e}\n{traceback.format_exc()}")
        print(json.dumps({
            "pass": False,
            "reward": 0.0,
            "message": f"Verifier error: {e}",
        }))
        sys.exit(1)
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: .github/workflows/eval-weekly.yml
Line: 46-48

Comment:
**Unconditional webarena-infinity clone for all eval runs**

The repo is cloned on every weekly eval run, even when `EVAL_CONFIG` points to a non-infinity config (e.g. `browseros-agent-weekly.json`). The `--depth 1` helps, but it's still an unnecessary network fetch for the common case. Consider guarding it behind an input flag or caching with `actions/cache` using the commit SHA as the key.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: packages/browseros-agent/apps/eval/src/graders/benchmark/agisdk-state-diff.ts
Line: 95-106

Comment:
**Unreachable message-scanning fallback is dead code**

The `for` loop that scans `input.messages` for Vercel URLs is never reached for valid AGI SDK tasks: `siteId` is always truthy (it's the task_id after stripping the trailing `-{number}`), so the function always returns early. `input.task.start_url` already carries the correct URL and could be used directly:

```suggestion
  private extractStartUrl(input: GraderInput): string | null {
    return input.task.start_url ?? null
  }
```

**Rule Used:** Remove unused/dead code rather than leaving it in ... ([source](https://app.greptile.com/review/custom-context?memory=9b045db4-2630-428c-95b7-ccf048d34547))

**Learnt From**
[browseros-ai/BrowserOS-agent#126](https://github.com/browseros-ai/BrowserOS-agent/pull/126)

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (2): Last reviewed commit: "ci: add timeout and continue-on-error fo..." | Re-trigger Greptile}

greptile-apps · 2026-04-09T06:54:49Z

+            "additional": {
+                "app_name": app_name,
+                "difficulty": difficulty,
+                "verifier_path": verifier_path,


Field key mismatch: app_port vs app_base_port

The dataset builder emits "app_port" but task-executor.ts reads app_base_port (line ~120: ?.app_base_port as number). The committed .jsonl file uses app_base_port so it works today, but any regeneration of the dataset via this script will silently produce the wrong key, causing the executor to fall back to port 8000 for all workers — leading to port conflicts when running 10 parallel tasks.

Suggested change

"verifier_path": verifier_path,

"app_base_port": base_port,

Prompt To Fix With AI

This is a comment left during a code review. Path: packages/browseros-agent/apps/eval/scripts/build-infinity-dataset.py Line: 55 Comment: **Field key mismatch: `app_port` vs `app_base_port`** The dataset builder emits `"app_port"` but `task-executor.ts` reads `app_base_port` (line ~120: `?.app_base_port as number`). The committed `.jsonl` file uses `app_base_port` so it works today, but any regeneration of the dataset via this script will silently produce the wrong key, causing the executor to fall back to port `8000` for all workers — leading to port conflicts when running 10 parallel tasks. ```suggestion "app_base_port": base_port, ``` How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-04-09T06:54:50Z

+      const appName = (task.metadata?.additional as Record<string, unknown>)
+        ?.app_name as string
+      const appBasePort =
+        ((task.metadata?.additional as Record<string, unknown>)


Hardcoded magic base port for worker-index derivation

this.config.browseros.base_server_port - 9110 assumes the global base is always 9110. If the base port is ever changed in config (e.g. to accommodate a different environment), all Infinity app instances will compute the wrong workerIndex, mapping every worker to the same app-server port and causing bind failures at scale. The worker index should be threaded through explicitly from the runner rather than back-calculated from a hardcoded constant.

Prompt To Fix With AI

This is a comment left during a code review. Path: packages/browseros-agent/apps/eval/src/runner/task-executor.ts Line: 113 Comment: **Hardcoded magic base port for worker-index derivation** `this.config.browseros.base_server_port - 9110` assumes the global base is always `9110`. If the base port is ever changed in config (e.g. to accommodate a different environment), all Infinity app instances will compute the wrong `workerIndex`, mapping every worker to the same app-server port and causing bind failures at scale. The worker index should be threaded through explicitly from the runner rather than back-calculated from a hardcoded constant. How can I resolve this? If you propose a fix, please make it concise.

Add agisdk_state_diff and infinity_state to PASS_FAIL_GRADER_ORDER in both runner types and weekly report script, so scores show correctly in the dashboard.

shivammittal274 · 2026-04-09T18:03:38Z

@greptile-ai review

greptile-apps · 2026-04-09T18:08:58Z

+  private async waitForReady(
+    url: string,
+    maxAttempts = 30,
+    intervalMs = 500,


waitForReady rejects non-2xx HTTP responses as "not ready"

resp.ok is true only for 2xx status codes. Many Python web frameworks (Flask, FastAPI) return 404 at / when no root route is defined, even after the server is fully started. This will cause waitForReady to spin through all 30 attempts and throw "server not ready" despite the server being alive — failing every Infinity task.

The check should accept any HTTP response (i.e., treat a network error as "not ready" but any HTTP status as "ready"):

Suggested change

intervalMs = 500,

if (resp.status > 0) return

Prompt To Fix With AI

This is a comment left during a code review. Path: packages/browseros-agent/apps/eval/src/runner/infinity-app-manager.ts Line: 72 Comment: **`waitForReady` rejects non-2xx HTTP responses as "not ready"** `resp.ok` is `true` only for 2xx status codes. Many Python web frameworks (Flask, FastAPI) return `404` at `/` when no root route is defined, even after the server is fully started. This will cause `waitForReady` to spin through all 30 attempts and throw "server not ready" despite the server being alive — failing every Infinity task. The check should accept any HTTP response (i.e., treat a network error as "not ready" but any HTTP status as "ready"): ```suggestion if (resp.status > 0) return ``` How can I resolve this? If you propose a fix, please make it concise.

github-actions Bot added the Feature label Apr 9, 2026

chore: switch eval configs back to kimi-k2p5

3e3cff5

greptile-apps Bot reviewed Apr 9, 2026

View reviewed changes

shivammittal274 added 4 commits April 9, 2026 13:23

fix: register deterministic graders in pass rate calculation

053c773

Add agisdk_state_diff and infinity_state to PASS_FAIL_GRADER_ORDER in both runner types and weekly report script, so scores show correctly in the dashboard.

chore: temp switch to opus 4.6 for eval run

16ce885

chore: restore kimi-k2p5 as default eval config

ef7a022

ci: add timeout and continue-on-error for trend report step

e310a5d

greptile-apps Bot reviewed Apr 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: deterministic eval graders (AGI SDK + WebArena-Infinity)#664

feat: deterministic eval graders (AGI SDK + WebArena-Infinity)#664
shivammittal274 wants to merge 6 commits intomainfrom
feat/deterministic-eval-graders-clean

shivammittal274 commented Apr 9, 2026

Uh oh!

greptile-apps Bot commented Apr 9, 2026 •

edited

Loading

Vulnerabilities

Uh oh!

greptile-apps Bot Apr 9, 2026

Uh oh!

greptile-apps Bot Apr 9, 2026

Uh oh!

shivammittal274 commented Apr 9, 2026

Uh oh!

greptile-apps Bot Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

shivammittal274 commented Apr 9, 2026

Summary

Initial Results (Kimi K2.5)

Test plan

Uh oh!

greptile-apps Bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Vulnerabilities

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

shivammittal274 commented Apr 9, 2026

Uh oh!

greptile-apps Bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps Bot commented Apr 9, 2026 •

edited

Loading