feat(agentos): Agent Simulation Engine — multi-judge evals + overview dashboard by patel-lyzr · Pull Request #22 · open-gitagent/ComputerAgent

patel-lyzr · 2026-06-05T07:23:44Z

Summary

Adds an Agent Simulation Engine to the AgentOS Console — an eval framework that grades registered agents against test suites, plus an observability-style overview dashboard.

Server (`packages/agentos-server`)

eval-runner — runs each case against the harness /run, consumes the SSE stream server-side, and captures final output + full trace (thinking/text/tool calls/results) + policy denials + cost/latency. Scores incrementally and persists per-case so the UI polls live progress.
4 scorers (eval-scorers):
- Golden match (exact / contains / regex)
- Tool & policy compliance (expected/forbidden tools; a policy-denied tool counts as compliant)
- NFR (cost / latency thresholds)
- Trace-aware LLM-as-a-judge — judges the whole trace, not just final output. Multiple judges per suite, each scored independently against its own rubric (OpenAI score_model style: 0–1 score + pass threshold; template vars {{prompt}} {{criteria}} {{output}} {{trace}} {{tools}} {{golden}}). A case passes only when every enabled scorer + judge passes.
eval-generate — synthesizes test cases from the agent's own identity files (agent.yaml/SOUL.md/RULES.md/…) plus a live tool probe.
routes/evals — suite CRUD, run trigger, run readback, case generation. New eval_suites + eval_runs Mongo collections; router mounted under auth.

SPA (`agentos`)

EvalsPage — suite editor with per-suite named judges (add/remove), suite detail, and a tabular run view where each case expands to its scores (with judge reasons), output, tool calls, policy denials, and a full trace/log panel.
SimDashboard — overview: pass-rate KPIs, pass-rate-over-time trend, per-scorer / per-judge and per-suite breakdowns, recent-runs table.

Testing

Built both images and verified end-to-end against the live stack:

2-judge suite (Correctness + Safety) on guard-demo: the IMDS-credential case scored Correctness ✗ (refused → task not completed) but Safety ✓ (refusal is the safe outcome) — same trace, judged two ways, case fails because not all judges pass.
Dashboard aggregation validated against live Mongo (5 runs / 13 cases), judges broken out separately.

🤖 Generated with Claude Code

… dashboard Adds an eval/simulation framework to the AgentOS Console for grading registered agents against test suites, with results surfaced in an observability-style dashboard. Server (packages/agentos-server): - eval-types: EvalSuite/EvalRun/CaseResult/JudgeDef/ScoreResult models. - eval-runner: runs each case against the harness /run, consumes the SSE stream server-side, captures output + full trace + tool calls + policy denials + cost/latency, scores it, and persists per-case for live polling. - eval-scorers: 4 scorers — golden match, tool & policy compliance, NFR (cost/latency), and a trace-aware LLM-as-a-judge. Multiple judges per suite, each scored independently against its own rubric (OpenAI score_model style: 0..1 score + pass threshold, template vars {{prompt}}/{{criteria}}/{{output}}/{{trace}}/{{tools}}/{{golden}}). A case passes only when every enabled scorer + judge passes. - eval-generate: synthesizes cases from the agent's own identity files + a live tool probe. - routes/evals: suite CRUD, run trigger, run readback, case generation. - mongo/index: eval_suites + eval_runs collections; router mounted. SPA (agentos): - EvalsPage: suite editor (per-suite named judges, add/remove), suite detail, and a tabular run view with per-case expandable trace/log. - SimDashboard: overview with pass-rate KPIs, pass-rate-over-time trend, per-scorer/per-judge and per-suite breakdowns, and a recent-runs table. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

abhi-bhat-lyzr approved these changes Jun 5, 2026

View reviewed changes

abhi-bhat-lyzr merged commit 65c4afd into deploy Jun 5, 2026
3 of 4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(agentos): Agent Simulation Engine — multi-judge evals + overview dashboard#22

feat(agentos): Agent Simulation Engine — multi-judge evals + overview dashboard#22
abhi-bhat-lyzr merged 1 commit into
deployfrom
feat/agent-simulation-engine

patel-lyzr commented Jun 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

patel-lyzr commented Jun 5, 2026

Summary

Server (packages/agentos-server)

SPA (agentos)

Testing

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Server (`packages/agentos-server`)

SPA (`agentos`)