Skip to content

feat: deterministic eval graders (AGI SDK + WebArena-Infinity)#664

Open
shivammittal274 wants to merge 6 commits intomainfrom
feat/deterministic-eval-graders-clean
Open

feat: deterministic eval graders (AGI SDK + WebArena-Infinity)#664
shivammittal274 wants to merge 6 commits intomainfrom
feat/deterministic-eval-graders-clean

Conversation

@shivammittal274
Copy link
Copy Markdown
Contributor

Summary

  • Two new benchmark integrations with programmatic grading — no LLM judge
  • AGI SDK / REAL Bench: 52 tasks across 11 consumer app clones (DoorDash, Amazon, Gmail, Airbnb, etc.). Grader fetches state diff from /finish page via MCP, Python verifier checks exact values via jmespath.
  • WebArena-Infinity: 50 hard tasks across 13 SaaS app clones (Gmail, GitLab, Linear, Figma, etc.). Fresh app server started per task per worker via InfinityAppManager. Python verifier checks /api/state.
  • Both support 10 parallel workers with isolated ports
  • CI workflow updated: pip install agisdk, clone webarena-infinity, OPENROUTER_API_KEY secret

Initial Results (Kimi K2.5)

Benchmark Tasks Pass Rate
AGI SDK / REAL 52 30.8%
WebArena-Infinity Hard 50 42.0%

Test plan

  • AGI SDK: 3-task local smoke test — 3/3 passed
  • AGI SDK: 52-task CI run — 30.8% pass rate, grading deterministic
  • Infinity: 3-task local smoke test — 2/3 passed (1 legit agent failure)
  • Infinity: 50-task CI run — 42% pass rate, grading deterministic
  • Run with Claude Opus 4.6 via OpenRouter

🤖 Generated with Claude Code

Two new benchmark integrations with programmatic grading — no LLM judge.

AGI SDK / REAL Bench (52 tasks):
- 11 React/Next.js clones of consumer apps (DoorDash, Amazon, Gmail, etc.)
- Grader navigates browser to /finish, extracts state diff from <pre> tag
- Python verifier checks exact values via jmespath queries

WebArena-Infinity (50 hard tasks):
- 13 LLM-generated SaaS clones (Gmail, GitLab, Linear, Figma, etc.)
- InfinityAppManager starts fresh app server per task per worker
- Python verifier calls /api/state and asserts on JSON state

Infrastructure:
- GraderInput extended with mcpUrl + infinityAppUrl for parallel workers
- Each worker gets isolated ports (no cross-worker state contamination)
- CI workflow: pip install agisdk, clone webarena-infinity repo
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 9, 2026

Greptile Summary

Adds two deterministic benchmark integrations (AGI SDK / REAL Bench and WebArena-Infinity) with programmatic graders, per-task app-server lifecycle management, and CI wiring. The implementation is well-structured with clear separation between the TypeScript runner and Python verifiers.

  • P1 – infinity-app-manager.ts: waitForReady uses resp.ok (2xx only); Python servers that return 404 at / will be treated as "not ready," timing out every Infinity task before the agent even starts.

Confidence Score: 4/5

Mostly safe — one P1 in the server-readiness check could silently fail all Infinity tasks; fix is a one-liner before merging.

The P1 waitForReady issue (resp.ok rejects non-2xx) is a genuine runtime defect that would cause InfinityAppManager to timeout on every task if the Python server doesn't serve 200 at root. The two P2s are non-blocking quality issues.

packages/browseros-agent/apps/eval/src/runner/infinity-app-manager.ts (P1 readiness check)

Vulnerabilities

No security concerns identified. Verifier scripts are loaded from a trusted local path (WEBARENA_INFINITY_DIR) set via CI environment variable rather than from user-controlled input.

Important Files Changed

Filename Overview
packages/browseros-agent/apps/eval/src/runner/infinity-app-manager.ts New class managing per-task Infinity server lifecycle; waitForReady only accepts 2xx responses which may false-timeout on servers that return 404 at root
packages/browseros-agent/apps/eval/src/graders/benchmark/agisdk-state-diff.ts New AGI SDK grader navigates /finish, extracts state diff via MCP, then runs Python evaluator; message-scanning fallback in extractStartUrl is unreachable dead code
packages/browseros-agent/apps/eval/src/graders/benchmark/infinity-state.ts New WebArena-Infinity grader resolving app URL and running Python verifier; uses Bun-specific import.meta.dir while sibling grader uses import.meta.dirname
packages/browseros-agent/apps/eval/scripts/infinity-evaluate.py Python verifier runner that exits with code 1 on error while printing JSON to stdout, causing the TypeScript caller to throw before reading the detailed error message
packages/browseros-agent/apps/eval/scripts/agisdk-evaluate.py Python AGI SDK evaluator bridge; correctly redirects stdout to stderr during evaluation and restores it on exception
packages/browseros-agent/apps/eval/src/runner/task-executor.ts Extended to start/stop InfinityAppManager per task and thread infinityAppUrl to graders; workerIndex back-calculation from hardcoded constant noted in prior thread
.github/workflows/eval-weekly.yml Adds Python deps and unconditionally clones webarena-infinity repo for every eval run regardless of config; adds timeout/continue-on-error to trend report step
packages/browseros-agent/apps/eval/scripts/build-infinity-dataset.py Dataset builder incrementing port per app; field name app_port in generated output vs app_base_port read by executor already flagged in prior thread
packages/browseros-agent/apps/eval/scripts/build-agisdk-dataset.py Reads agisdk task list, filters infeasible/llm-eval tasks, and emits correctly-formatted JSONL; straightforward and correct
packages/browseros-agent/apps/eval/src/graders/registry.ts Registers two new deterministic graders and exports their classes; no issues

Sequence Diagram

sequenceDiagram
    participant Runner as TaskRunner (Worker N)
    participant TE as TaskExecutor
    participant IAM as InfinityAppManager
    participant Browser as BrowserOS/MCP
    participant Agent as Agent
    participant PY as Python Verifier

    Runner->>TE: executeTask(task)
    TE->>IAM: startApp(appName)
    IAM->>IAM: spawn python3 server.py --port (base+N)
    IAM->>IAM: waitForReady (polls HTTP)
    IAM-->>TE: appUrl (http://localhost:PORT)

    TE->>Browser: navigate_page(appUrl)
    TE->>Agent: run(task, pageId)
    Agent-->>TE: AgentResult

    alt AGI SDK task
        TE->>Browser: navigate_page(/finish)
        Browser-->>TE: pre JSON (env_state)
        TE->>PY: agisdk-evaluate.py (stdin JSON)
        PY-->>TE: {reward, pass, per_criterion}
    else Infinity task
        TE->>PY: infinity-evaluate.py (app_server_url, verifier_path)
        PY->>IAM: GET /api/state
        IAM-->>PY: app state
        PY-->>TE: {pass, reward, message}
    end

    TE->>IAM: stop()
    IAM->>IAM: SIGTERM then SIGKILL after 3s
    TE-->>Runner: TaskResult
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: packages/browseros-agent/apps/eval/src/runner/infinity-app-manager.ts
Line: 72

Comment:
**`waitForReady` rejects non-2xx HTTP responses as "not ready"**

`resp.ok` is `true` only for 2xx status codes. Many Python web frameworks (Flask, FastAPI) return `404` at `/` when no root route is defined, even after the server is fully started. This will cause `waitForReady` to spin through all 30 attempts and throw "server not ready" despite the server being alive — failing every Infinity task.

The check should accept any HTTP response (i.e., treat a network error as "not ready" but any HTTP status as "ready"):

```suggestion
        if (resp.status > 0) return
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: packages/browseros-agent/apps/eval/scripts/infinity-evaluate.py
Line: 64-69

Comment:
**Verifier error detail lost on non-zero exit code**

When an exception is caught here, the script prints a JSON error object to stdout, then calls `sys.exit(1)`. However, `infinity-state.ts` throws immediately on a non-zero exit code without reading stdout, so the full verifier traceback is silently discarded. Consider writing the traceback to stderr instead:

```suggestion
    except Exception as e:
        sys.stderr.write(f"Verifier error: {e}\n{traceback.format_exc()}")
        print(json.dumps({
            "pass": False,
            "reward": 0.0,
            "message": f"Verifier error: {e}",
        }))
        sys.exit(1)
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: .github/workflows/eval-weekly.yml
Line: 46-48

Comment:
**Unconditional webarena-infinity clone for all eval runs**

The repo is cloned on every weekly eval run, even when `EVAL_CONFIG` points to a non-infinity config (e.g. `browseros-agent-weekly.json`). The `--depth 1` helps, but it's still an unnecessary network fetch for the common case. Consider guarding it behind an input flag or caching with `actions/cache` using the commit SHA as the key.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: packages/browseros-agent/apps/eval/src/graders/benchmark/agisdk-state-diff.ts
Line: 95-106

Comment:
**Unreachable message-scanning fallback is dead code**

The `for` loop that scans `input.messages` for Vercel URLs is never reached for valid AGI SDK tasks: `siteId` is always truthy (it's the task_id after stripping the trailing `-{number}`), so the function always returns early. `input.task.start_url` already carries the correct URL and could be used directly:

```suggestion
  private extractStartUrl(input: GraderInput): string | null {
    return input.task.start_url ?? null
  }
```

**Rule Used:** Remove unused/dead code rather than leaving it in ... ([source](https://app.greptile.com/review/custom-context?memory=9b045db4-2630-428c-95b7-ccf048d34547))

**Learnt From**
[browseros-ai/BrowserOS-agent#126](https://github.com/browseros-ai/BrowserOS-agent/pull/126)

How can I resolve this? If you propose a fix, please make it concise.

Reviews (2): Last reviewed commit: "ci: add timeout and continue-on-error fo..." | Re-trigger Greptile

"additional": {
"app_name": app_name,
"difficulty": difficulty,
"verifier_path": verifier_path,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Field key mismatch: app_port vs app_base_port

The dataset builder emits "app_port" but task-executor.ts reads app_base_port (line ~120: ?.app_base_port as number). The committed .jsonl file uses app_base_port so it works today, but any regeneration of the dataset via this script will silently produce the wrong key, causing the executor to fall back to port 8000 for all workers — leading to port conflicts when running 10 parallel tasks.

Suggested change
"verifier_path": verifier_path,
"app_base_port": base_port,
Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/browseros-agent/apps/eval/scripts/build-infinity-dataset.py
Line: 55

Comment:
**Field key mismatch: `app_port` vs `app_base_port`**

The dataset builder emits `"app_port"` but `task-executor.ts` reads `app_base_port` (line ~120: `?.app_base_port as number`). The committed `.jsonl` file uses `app_base_port` so it works today, but any regeneration of the dataset via this script will silently produce the wrong key, causing the executor to fall back to port `8000` for all workers — leading to port conflicts when running 10 parallel tasks.

```suggestion
                    "app_base_port": base_port,
```

How can I resolve this? If you propose a fix, please make it concise.

const appName = (task.metadata?.additional as Record<string, unknown>)
?.app_name as string
const appBasePort =
((task.metadata?.additional as Record<string, unknown>)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Hardcoded magic base port for worker-index derivation

this.config.browseros.base_server_port - 9110 assumes the global base is always 9110. If the base port is ever changed in config (e.g. to accommodate a different environment), all Infinity app instances will compute the wrong workerIndex, mapping every worker to the same app-server port and causing bind failures at scale. The worker index should be threaded through explicitly from the runner rather than back-calculated from a hardcoded constant.

Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/browseros-agent/apps/eval/src/runner/task-executor.ts
Line: 113

Comment:
**Hardcoded magic base port for worker-index derivation**

`this.config.browseros.base_server_port - 9110` assumes the global base is always `9110`. If the base port is ever changed in config (e.g. to accommodate a different environment), all Infinity app instances will compute the wrong `workerIndex`, mapping every worker to the same app-server port and causing bind failures at scale. The worker index should be threaded through explicitly from the runner rather than back-calculated from a hardcoded constant.

How can I resolve this? If you propose a fix, please make it concise.

Add agisdk_state_diff and infinity_state to PASS_FAIL_GRADER_ORDER
in both runner types and weekly report script, so scores show correctly
in the dashboard.
@shivammittal274
Copy link
Copy Markdown
Contributor Author

@greptile-ai review

private async waitForReady(
url: string,
maxAttempts = 30,
intervalMs = 500,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 waitForReady rejects non-2xx HTTP responses as "not ready"

resp.ok is true only for 2xx status codes. Many Python web frameworks (Flask, FastAPI) return 404 at / when no root route is defined, even after the server is fully started. This will cause waitForReady to spin through all 30 attempts and throw "server not ready" despite the server being alive — failing every Infinity task.

The check should accept any HTTP response (i.e., treat a network error as "not ready" but any HTTP status as "ready"):

Suggested change
intervalMs = 500,
if (resp.status > 0) return
Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/browseros-agent/apps/eval/src/runner/infinity-app-manager.ts
Line: 72

Comment:
**`waitForReady` rejects non-2xx HTTP responses as "not ready"**

`resp.ok` is `true` only for 2xx status codes. Many Python web frameworks (Flask, FastAPI) return `404` at `/` when no root route is defined, even after the server is fully started. This will cause `waitForReady` to spin through all 30 attempts and throw "server not ready" despite the server being alive — failing every Infinity task.

The check should accept any HTTP response (i.e., treat a network error as "not ready" but any HTTP status as "ready"):

```suggestion
        if (resp.status > 0) return
```

How can I resolve this? If you propose a fix, please make it concise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant