Skip to content

feat(agentos): Agent Simulation Engine — multi-judge evals + overview dashboard#22

Merged
abhi-bhat-lyzr merged 1 commit into
deployfrom
feat/agent-simulation-engine
Jun 5, 2026
Merged

feat(agentos): Agent Simulation Engine — multi-judge evals + overview dashboard#22
abhi-bhat-lyzr merged 1 commit into
deployfrom
feat/agent-simulation-engine

Conversation

@patel-lyzr
Copy link
Copy Markdown
Collaborator

Summary

Adds an Agent Simulation Engine to the AgentOS Console — an eval framework that grades registered agents against test suites, plus an observability-style overview dashboard.

Server (packages/agentos-server)

  • eval-runner — runs each case against the harness /run, consumes the SSE stream server-side, and captures final output + full trace (thinking/text/tool calls/results) + policy denials + cost/latency. Scores incrementally and persists per-case so the UI polls live progress.
  • 4 scorers (eval-scorers):
    • Golden match (exact / contains / regex)
    • Tool & policy compliance (expected/forbidden tools; a policy-denied tool counts as compliant)
    • NFR (cost / latency thresholds)
    • Trace-aware LLM-as-a-judge — judges the whole trace, not just final output. Multiple judges per suite, each scored independently against its own rubric (OpenAI score_model style: 0–1 score + pass threshold; template vars {{prompt}} {{criteria}} {{output}} {{trace}} {{tools}} {{golden}}). A case passes only when every enabled scorer + judge passes.
  • eval-generate — synthesizes test cases from the agent's own identity files (agent.yaml/SOUL.md/RULES.md/…) plus a live tool probe.
  • routes/evals — suite CRUD, run trigger, run readback, case generation. New eval_suites + eval_runs Mongo collections; router mounted under auth.

SPA (agentos)

  • EvalsPage — suite editor with per-suite named judges (add/remove), suite detail, and a tabular run view where each case expands to its scores (with judge reasons), output, tool calls, policy denials, and a full trace/log panel.
  • SimDashboard — overview: pass-rate KPIs, pass-rate-over-time trend, per-scorer / per-judge and per-suite breakdowns, recent-runs table.

Testing

Built both images and verified end-to-end against the live stack:

  • 2-judge suite (Correctness + Safety) on guard-demo: the IMDS-credential case scored Correctness ✗ (refused → task not completed) but Safety ✓ (refusal is the safe outcome) — same trace, judged two ways, case fails because not all judges pass.
  • Dashboard aggregation validated against live Mongo (5 runs / 13 cases), judges broken out separately.

🤖 Generated with Claude Code

… dashboard

Adds an eval/simulation framework to the AgentOS Console for grading
registered agents against test suites, with results surfaced in an
observability-style dashboard.

Server (packages/agentos-server):
- eval-types: EvalSuite/EvalRun/CaseResult/JudgeDef/ScoreResult models.
- eval-runner: runs each case against the harness /run, consumes the SSE
  stream server-side, captures output + full trace + tool calls + policy
  denials + cost/latency, scores it, and persists per-case for live polling.
- eval-scorers: 4 scorers — golden match, tool & policy compliance, NFR
  (cost/latency), and a trace-aware LLM-as-a-judge. Multiple judges per
  suite, each scored independently against its own rubric (OpenAI
  score_model style: 0..1 score + pass threshold, template vars
  {{prompt}}/{{criteria}}/{{output}}/{{trace}}/{{tools}}/{{golden}}).
  A case passes only when every enabled scorer + judge passes.
- eval-generate: synthesizes cases from the agent's own identity files +
  a live tool probe.
- routes/evals: suite CRUD, run trigger, run readback, case generation.
- mongo/index: eval_suites + eval_runs collections; router mounted.

SPA (agentos):
- EvalsPage: suite editor (per-suite named judges, add/remove), suite
  detail, and a tabular run view with per-case expandable trace/log.
- SimDashboard: overview with pass-rate KPIs, pass-rate-over-time trend,
  per-scorer/per-judge and per-suite breakdowns, and a recent-runs table.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@abhi-bhat-lyzr abhi-bhat-lyzr merged commit 65c4afd into deploy Jun 5, 2026
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants