feat: deterministic eval graders (AGI SDK + WebArena-Infinity)#664
feat: deterministic eval graders (AGI SDK + WebArena-Infinity)#664shivammittal274 wants to merge 6 commits intomainfrom
Conversation
Two new benchmark integrations with programmatic grading — no LLM judge. AGI SDK / REAL Bench (52 tasks): - 11 React/Next.js clones of consumer apps (DoorDash, Amazon, Gmail, etc.) - Grader navigates browser to /finish, extracts state diff from <pre> tag - Python verifier checks exact values via jmespath queries WebArena-Infinity (50 hard tasks): - 13 LLM-generated SaaS clones (Gmail, GitLab, Linear, Figma, etc.) - InfinityAppManager starts fresh app server per task per worker - Python verifier calls /api/state and asserts on JSON state Infrastructure: - GraderInput extended with mcpUrl + infinityAppUrl for parallel workers - Each worker gets isolated ports (no cross-worker state contamination) - CI workflow: pip install agisdk, clone webarena-infinity repo
Greptile SummaryAdds two deterministic benchmark integrations (AGI SDK / REAL Bench and WebArena-Infinity) with programmatic graders, per-task app-server lifecycle management, and CI wiring. The implementation is well-structured with clear separation between the TypeScript runner and Python verifiers.
Confidence Score: 4/5Mostly safe — one P1 in the server-readiness check could silently fail all Infinity tasks; fix is a one-liner before merging. The P1 waitForReady issue (resp.ok rejects non-2xx) is a genuine runtime defect that would cause InfinityAppManager to timeout on every task if the Python server doesn't serve 200 at root. The two P2s are non-blocking quality issues. packages/browseros-agent/apps/eval/src/runner/infinity-app-manager.ts (P1 readiness check)
|
| Filename | Overview |
|---|---|
| packages/browseros-agent/apps/eval/src/runner/infinity-app-manager.ts | New class managing per-task Infinity server lifecycle; waitForReady only accepts 2xx responses which may false-timeout on servers that return 404 at root |
| packages/browseros-agent/apps/eval/src/graders/benchmark/agisdk-state-diff.ts | New AGI SDK grader navigates /finish, extracts state diff via MCP, then runs Python evaluator; message-scanning fallback in extractStartUrl is unreachable dead code |
| packages/browseros-agent/apps/eval/src/graders/benchmark/infinity-state.ts | New WebArena-Infinity grader resolving app URL and running Python verifier; uses Bun-specific import.meta.dir while sibling grader uses import.meta.dirname |
| packages/browseros-agent/apps/eval/scripts/infinity-evaluate.py | Python verifier runner that exits with code 1 on error while printing JSON to stdout, causing the TypeScript caller to throw before reading the detailed error message |
| packages/browseros-agent/apps/eval/scripts/agisdk-evaluate.py | Python AGI SDK evaluator bridge; correctly redirects stdout to stderr during evaluation and restores it on exception |
| packages/browseros-agent/apps/eval/src/runner/task-executor.ts | Extended to start/stop InfinityAppManager per task and thread infinityAppUrl to graders; workerIndex back-calculation from hardcoded constant noted in prior thread |
| .github/workflows/eval-weekly.yml | Adds Python deps and unconditionally clones webarena-infinity repo for every eval run regardless of config; adds timeout/continue-on-error to trend report step |
| packages/browseros-agent/apps/eval/scripts/build-infinity-dataset.py | Dataset builder incrementing port per app; field name app_port in generated output vs app_base_port read by executor already flagged in prior thread |
| packages/browseros-agent/apps/eval/scripts/build-agisdk-dataset.py | Reads agisdk task list, filters infeasible/llm-eval tasks, and emits correctly-formatted JSONL; straightforward and correct |
| packages/browseros-agent/apps/eval/src/graders/registry.ts | Registers two new deterministic graders and exports their classes; no issues |
Sequence Diagram
sequenceDiagram
participant Runner as TaskRunner (Worker N)
participant TE as TaskExecutor
participant IAM as InfinityAppManager
participant Browser as BrowserOS/MCP
participant Agent as Agent
participant PY as Python Verifier
Runner->>TE: executeTask(task)
TE->>IAM: startApp(appName)
IAM->>IAM: spawn python3 server.py --port (base+N)
IAM->>IAM: waitForReady (polls HTTP)
IAM-->>TE: appUrl (http://localhost:PORT)
TE->>Browser: navigate_page(appUrl)
TE->>Agent: run(task, pageId)
Agent-->>TE: AgentResult
alt AGI SDK task
TE->>Browser: navigate_page(/finish)
Browser-->>TE: pre JSON (env_state)
TE->>PY: agisdk-evaluate.py (stdin JSON)
PY-->>TE: {reward, pass, per_criterion}
else Infinity task
TE->>PY: infinity-evaluate.py (app_server_url, verifier_path)
PY->>IAM: GET /api/state
IAM-->>PY: app state
PY-->>TE: {pass, reward, message}
end
TE->>IAM: stop()
IAM->>IAM: SIGTERM then SIGKILL after 3s
TE-->>Runner: TaskResult
Prompt To Fix All With AI
This is a comment left during a code review.
Path: packages/browseros-agent/apps/eval/src/runner/infinity-app-manager.ts
Line: 72
Comment:
**`waitForReady` rejects non-2xx HTTP responses as "not ready"**
`resp.ok` is `true` only for 2xx status codes. Many Python web frameworks (Flask, FastAPI) return `404` at `/` when no root route is defined, even after the server is fully started. This will cause `waitForReady` to spin through all 30 attempts and throw "server not ready" despite the server being alive — failing every Infinity task.
The check should accept any HTTP response (i.e., treat a network error as "not ready" but any HTTP status as "ready"):
```suggestion
if (resp.status > 0) return
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: packages/browseros-agent/apps/eval/scripts/infinity-evaluate.py
Line: 64-69
Comment:
**Verifier error detail lost on non-zero exit code**
When an exception is caught here, the script prints a JSON error object to stdout, then calls `sys.exit(1)`. However, `infinity-state.ts` throws immediately on a non-zero exit code without reading stdout, so the full verifier traceback is silently discarded. Consider writing the traceback to stderr instead:
```suggestion
except Exception as e:
sys.stderr.write(f"Verifier error: {e}\n{traceback.format_exc()}")
print(json.dumps({
"pass": False,
"reward": 0.0,
"message": f"Verifier error: {e}",
}))
sys.exit(1)
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: .github/workflows/eval-weekly.yml
Line: 46-48
Comment:
**Unconditional webarena-infinity clone for all eval runs**
The repo is cloned on every weekly eval run, even when `EVAL_CONFIG` points to a non-infinity config (e.g. `browseros-agent-weekly.json`). The `--depth 1` helps, but it's still an unnecessary network fetch for the common case. Consider guarding it behind an input flag or caching with `actions/cache` using the commit SHA as the key.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: packages/browseros-agent/apps/eval/src/graders/benchmark/agisdk-state-diff.ts
Line: 95-106
Comment:
**Unreachable message-scanning fallback is dead code**
The `for` loop that scans `input.messages` for Vercel URLs is never reached for valid AGI SDK tasks: `siteId` is always truthy (it's the task_id after stripping the trailing `-{number}`), so the function always returns early. `input.task.start_url` already carries the correct URL and could be used directly:
```suggestion
private extractStartUrl(input: GraderInput): string | null {
return input.task.start_url ?? null
}
```
**Rule Used:** Remove unused/dead code rather than leaving it in ... ([source](https://app.greptile.com/review/custom-context?memory=9b045db4-2630-428c-95b7-ccf048d34547))
**Learnt From**
[browseros-ai/BrowserOS-agent#126](https://github.com/browseros-ai/BrowserOS-agent/pull/126)
How can I resolve this? If you propose a fix, please make it concise.Reviews (2): Last reviewed commit: "ci: add timeout and continue-on-error fo..." | Re-trigger Greptile
| "additional": { | ||
| "app_name": app_name, | ||
| "difficulty": difficulty, | ||
| "verifier_path": verifier_path, |
There was a problem hiding this comment.
Field key mismatch:
app_port vs app_base_port
The dataset builder emits "app_port" but task-executor.ts reads app_base_port (line ~120: ?.app_base_port as number). The committed .jsonl file uses app_base_port so it works today, but any regeneration of the dataset via this script will silently produce the wrong key, causing the executor to fall back to port 8000 for all workers — leading to port conflicts when running 10 parallel tasks.
| "verifier_path": verifier_path, | |
| "app_base_port": base_port, |
Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/browseros-agent/apps/eval/scripts/build-infinity-dataset.py
Line: 55
Comment:
**Field key mismatch: `app_port` vs `app_base_port`**
The dataset builder emits `"app_port"` but `task-executor.ts` reads `app_base_port` (line ~120: `?.app_base_port as number`). The committed `.jsonl` file uses `app_base_port` so it works today, but any regeneration of the dataset via this script will silently produce the wrong key, causing the executor to fall back to port `8000` for all workers — leading to port conflicts when running 10 parallel tasks.
```suggestion
"app_base_port": base_port,
```
How can I resolve this? If you propose a fix, please make it concise.| const appName = (task.metadata?.additional as Record<string, unknown>) | ||
| ?.app_name as string | ||
| const appBasePort = | ||
| ((task.metadata?.additional as Record<string, unknown>) |
There was a problem hiding this comment.
Hardcoded magic base port for worker-index derivation
this.config.browseros.base_server_port - 9110 assumes the global base is always 9110. If the base port is ever changed in config (e.g. to accommodate a different environment), all Infinity app instances will compute the wrong workerIndex, mapping every worker to the same app-server port and causing bind failures at scale. The worker index should be threaded through explicitly from the runner rather than back-calculated from a hardcoded constant.
Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/browseros-agent/apps/eval/src/runner/task-executor.ts
Line: 113
Comment:
**Hardcoded magic base port for worker-index derivation**
`this.config.browseros.base_server_port - 9110` assumes the global base is always `9110`. If the base port is ever changed in config (e.g. to accommodate a different environment), all Infinity app instances will compute the wrong `workerIndex`, mapping every worker to the same app-server port and causing bind failures at scale. The worker index should be threaded through explicitly from the runner rather than back-calculated from a hardcoded constant.
How can I resolve this? If you propose a fix, please make it concise.Add agisdk_state_diff and infinity_state to PASS_FAIL_GRADER_ORDER in both runner types and weekly report script, so scores show correctly in the dashboard.
|
@greptile-ai review |
| private async waitForReady( | ||
| url: string, | ||
| maxAttempts = 30, | ||
| intervalMs = 500, |
There was a problem hiding this comment.
waitForReady rejects non-2xx HTTP responses as "not ready"
resp.ok is true only for 2xx status codes. Many Python web frameworks (Flask, FastAPI) return 404 at / when no root route is defined, even after the server is fully started. This will cause waitForReady to spin through all 30 attempts and throw "server not ready" despite the server being alive — failing every Infinity task.
The check should accept any HTTP response (i.e., treat a network error as "not ready" but any HTTP status as "ready"):
| intervalMs = 500, | |
| if (resp.status > 0) return |
Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/browseros-agent/apps/eval/src/runner/infinity-app-manager.ts
Line: 72
Comment:
**`waitForReady` rejects non-2xx HTTP responses as "not ready"**
`resp.ok` is `true` only for 2xx status codes. Many Python web frameworks (Flask, FastAPI) return `404` at `/` when no root route is defined, even after the server is fully started. This will cause `waitForReady` to spin through all 30 attempts and throw "server not ready" despite the server being alive — failing every Infinity task.
The check should accept any HTTP response (i.e., treat a network error as "not ready" but any HTTP status as "ready"):
```suggestion
if (resp.status > 0) return
```
How can I resolve this? If you propose a fix, please make it concise.
Summary
/finishpage via MCP, Python verifier checks exact values via jmespath.InfinityAppManager. Python verifier checks/api/state.pip install agisdk, clone webarena-infinity,OPENROUTER_API_KEYsecretInitial Results (Kimi K2.5)
Test plan
🤖 Generated with Claude Code