Conway Agent OS v0.1 MVP + Comparative Eval Scoring by devin-ai-integration[bot] · Pull Request #328 · Conway-Research/automaton

devin-ai-integration · 2026-06-05T14:02:50Z

Summary

Full Conway Agent OS implementation (Observe → Remember → Suggest → Evaluate → Govern → Deploy → Monetize → Receipt → Learn → Reward → Upgrade) with comparative eval scoring — baseline vs candidate agents are simulated against auto-generated rubrics and scored so the user can deploy the best candidate.

Comparative Eval (new in latest commit)

Eval runs now go beyond pass/fail attestations:

rubric_generator.py — auto-generates Rubric from production traces (AutonomyReceipt, DeploymentReceipt, PaymentReceipt), deriving dimensions across capability/deployment/monetization/safety categories
run_comparative_eval() — simulates baseline + candidate against worldlets, scores per-dimension via _simulate_agent_scores(), computes weighted average, picks winner + recommendation
POST /v1/eval/compare — accepts {baseline_id, candidate_id, project_id}, optionally auto-generates rubric if no rubric_id provided
POST /v1/eval/runs/{id}/deploy-candidate — promotes winning candidate to canary
EvalRun model gains eval_mode, baseline_id, baseline_scores, candidate_scores, winner, score_delta, recommendation
Frontend Eval.tsx — side-by-side score table with per-dimension deltas, winner badge, "Deploy Candidate" button
Demo flow — creates baseline trace, auto-generates 6-dimension rubric, runs comparative eval (baseline ~0.95 vs candidate ~0.98 → deploy_candidate)

41 tests passing (36 original + 5 new comparative eval tests).

Link to Devin session: https://app.devin.ai/sessions/8fda63f035ef4651a663b6d48a53b3de
Requested by: @Sigil-Wen

Implements the Conway Everyone-Improving Agent OS with: - Full backend (FastAPI, SQLAlchemy, Pydantic v2, SQLite) - 14 API route modules covering the complete system - 15+ services including budget governor, eval runner, deploy broker - Mock Sanctuary adapter, mock MPP gateway, mock Cloudflare deploy - React/TypeScript Conway Cockpit frontend - 36 passing tests covering privacy, budget, authority, eval, deploy, rewards - Full demo flow: Observe -> Remember -> Suggest -> Evaluate -> Govern -> Deploy -> Monetize -> Receipt -> Learn -> Reward -> Upgrade - Makefile: install, dev, test, seed, demo commands Co-Authored-By: Sigil Wen <sigil.w3n@gmail.com>

devin-ai-integration · 2026-06-05T14:02:53Z

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

Disable automatic comment, CI, and merge conflict monitoring

devin-ai-integration

Devin Review found 2 potential issues.

View 6 additional findings in Devin Review.

devin-ai-integration · 2026-06-05T14:07:11Z

+    refund = entry.amount_reserved - actual_cost
+    if refund > 0:
+        project.treasury_balance += refund


🔴 settle_budget silently ignores cost overruns, leaving treasury balance overstated

When actual_cost > amount_reserved, the refund variable is negative, so the if refund > 0 guard at line 100 prevents any balance adjustment. Since reserve_budget (budget_governor.py:84) only decremented the treasury by amount_reserved, the excess cost (actual_cost - amount_reserved) is never deducted. This leaves project.treasury_balance artificially inflated, which can cause subsequent reserve_budget calls to incorrectly approve reservations that should be denied due to insufficient funds.

Example: reserve $5, settle at $8

reserve_budget: treasury goes from $100 → $95 (reserved $5)

settle_budget: refund = 5 - 8 = -3, if -3 > 0 is False → no adjustment

Treasury stays at $95, but should be $92 (short by $3)

Suggested change

refund = entry.amount_reserved - actual_cost

if refund > 0:

project.treasury_balance += refund

refund = entry.amount_reserved - actual_cost

if refund > 0:

project.treasury_balance += refund

elif refund < 0:

project.treasury_balance += refund # deduct the overrun

Was this helpful? React with 👍 or 👎 to provide feedback.

Fixed — now deducts the overrun from treasury when actual_cost > amount_reserved.

devin-ai-integration · 2026-06-05T14:07:13Z

+) -> float:
+    utility = (
+        expected_value
+        + WEIGHTS["beta_cash"] * math.log(cash_after + EPS)


🔴 rank_candidate crashes with ValueError on low-balance projects due to math.log of negative number

math.log(cash_after + EPS) raises ValueError: math domain error when cash_after + EPS <= 0. This is reachable via the /v1/suggestions/generate/{project_id} API endpoint: generate_suggestions calls rank_candidate with cash_after=project.treasury_balance - 0.03 (line 69) and cash_after=project.treasury_balance - 0.50 (line 113). If a project's treasury balance is near zero (e.g., after budget deductions in RED/YELLOW mode), cash_after + EPS becomes negative, crashing the suggestion generation with an unhandled 500 error.

Suggested change

+ WEIGHTS["beta_cash"] * math.log(cash_after + EPS)

+ WEIGHTS["beta_cash"] * math.log(max(cash_after, 0) + EPS)

Was this helpful? React with 👍 or 👎 to provide feedback.

Fixed — clamped cash_after to 0 before log() to prevent ValueError on low-balance projects.

- suggestion_engine: clamp cash_after to 0 before log to avoid ValueError - budget_governor: deduct cost overruns when actual_cost > reserved amount Co-Authored-By: Sigil Wen <sigil.w3n@gmail.com>

devin-ai-integration · 2026-06-05T14:48:50Z

E2E Test Results — All Passed

Ran backend + frontend locally, seeded data, executed 23-step demo loop, then verified invariants via API and UI.

5/5 Tests Passed

Test	Result
Backend demo loop completes (23 steps)	passed
Privacy: hidden holdouts redacted in public API	passed
Budget governor blocks RED-mode deploys (403)	passed
Frontend Cockpit renders all pages with data	passed
Rewards page shows $0.02 payment + 4 ledger entries	passed

Key Assertion Values

demo_loop_completed: True
project_mode: GREEN
reward_entries: 4
raw_receipt_publicly_visible: False
hidden_holdout_publicly_visible: False
r5_govern_blocked: True
r2_upgrade_authorized: True

RED-mode deploy test: POST /v1/deploy/{capsule}/canary → HTTP 403 "Project in RED mode, deploy blocked"

Screenshots

Dashboard:

Suggestions (after Generate):

Rewards & Revenue:

Devin session

…rated rubrics - Add EvalRun model fields: eval_mode, baseline_id, baseline_scores, candidate_scores, winner, score_delta, recommendation - New rubric_generator service: auto-generates rubric dimensions from production traces (autonomy receipts, deployment receipts, payments) - New run_comparative_eval in eval_runner: simulates baseline and candidate agents against worldlets, scores per-dimension, picks winner - POST /v1/eval/compare endpoint: triggers comparative eval with optional auto-generated rubric - POST /v1/eval/runs/{id}/deploy-candidate: promotes winning candidate - GET /v1/eval/runs: returns comparison data for comparative runs - Frontend Eval page: side-by-side score table with per-dimension deltas, winner badge, and Deploy Candidate button - Demo flow updated: creates baseline trace, auto-generates rubric, runs comparative eval (baseline 0.95 vs candidate 0.98) - 5 new tests for comparative eval (41 total, all passing) Co-Authored-By: Sigil Wen <sigil.w3n@gmail.com>

devin-ai-integration · 2026-06-06T05:06:38Z

Comparative Eval Scoring — E2E Test Results

Ran backend + frontend locally against seeded data with demo flow, tested the comparative eval UI and API end-to-end.

Comparative Eval UI Tests (4/4 passed)

Comparative eval cards with score tables — passed: "COMPARATIVE EVALS (2)" header, two cards with "Baseline vs Candidate" title, "AUTO-GENERATED RUBRIC" badge, "WINNER: CANDIDATE" badge, score table (Dimension/Baseline/Candidate/Delta), baseline in orange, candidate in green, Weighted Average row, "Recommendation: DEPLOY CANDIDATE"
Deploy Candidate button — passed: clicked button, green success banner: "Deployed: Candidate cap_747742d71c6e approved for deployment"
Single-mode eval runs — passed: "EVAL RUNS (1)" with "PASSED" badge, no comparative fields
Attestations section — passed: 2 attestations rendered correctly

API Edge Case Tests (2/2 passed)

POST /v1/eval/compare — passed: created new comparison with 12 auto-generated dimensions, rubric_auto_generated: "true", winner: "candidate"
Deploy-candidate rejects non-comparative — passed: HTTP 400, {"detail":"Not a comparative eval run"}

6/6 tests passed. No escalations.

Devin session

devin-ai-integration Bot assigned Sigil-Wen Jun 5, 2026

devin-ai-integration Bot commented Jun 5, 2026

View reviewed changes

fix: prevent math.log crash on low balance + handle settle overruns

0af5d5a

- suggestion_engine: clamp cash_after to 0 before log to avoid ValueError - budget_governor: deduct cost overruns when actual_cost > reserved amount Co-Authored-By: Sigil Wen <sigil.w3n@gmail.com>

devin-ai-integration Bot changed the title ~~Conway Agent OS v0.1 MVP — Full loop implementation~~ Conway Agent OS v0.1 MVP + Comparative Eval Scoring Jun 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Conway Agent OS v0.1 MVP + Comparative Eval Scoring#328

Conway Agent OS v0.1 MVP + Comparative Eval Scoring#328
devin-ai-integration[bot] wants to merge 3 commits into
mainfrom
devin/1780668136-conway-os-mvp

devin-ai-integration Bot commented Jun 5, 2026 •

edited

Loading

Uh oh!

devin-ai-integration Bot commented Jun 5, 2026

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

devin-ai-integration Bot Jun 5, 2026

Uh oh!

devin-ai-integration Bot Jun 5, 2026

Uh oh!

devin-ai-integration Bot Jun 5, 2026

Uh oh!

devin-ai-integration Bot Jun 5, 2026

Uh oh!

devin-ai-integration Bot commented Jun 5, 2026

Uh oh!

devin-ai-integration Bot commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	+ WEIGHTS["beta_cash"] * math.log(cash_after + EPS)
	+ WEIGHTS["beta_cash"] * math.log(max(cash_after, 0) + EPS)

Conversation

devin-ai-integration Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Comparative Eval (new in latest commit)

Uh oh!

devin-ai-integration Bot commented Jun 5, 2026

🤖 Devin AI Engineer

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot commented Jun 5, 2026

E2E Test Results — All Passed

Uh oh!

devin-ai-integration Bot commented Jun 6, 2026

Comparative Eval Scoring — E2E Test Results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

devin-ai-integration Bot commented Jun 5, 2026 •

edited

Loading