Skip to content

Conway Agent OS v0.1 MVP + Comparative Eval Scoring#328

Open
devin-ai-integration[bot] wants to merge 3 commits into
mainfrom
devin/1780668136-conway-os-mvp
Open

Conway Agent OS v0.1 MVP + Comparative Eval Scoring#328
devin-ai-integration[bot] wants to merge 3 commits into
mainfrom
devin/1780668136-conway-os-mvp

Conversation

@devin-ai-integration

@devin-ai-integration devin-ai-integration Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Summary

Full Conway Agent OS implementation (Observe → Remember → Suggest → Evaluate → Govern → Deploy → Monetize → Receipt → Learn → Reward → Upgrade) with comparative eval scoring — baseline vs candidate agents are simulated against auto-generated rubrics and scored so the user can deploy the best candidate.

Comparative Eval (new in latest commit)

Eval runs now go beyond pass/fail attestations:

  1. rubric_generator.py — auto-generates Rubric from production traces (AutonomyReceipt, DeploymentReceipt, PaymentReceipt), deriving dimensions across capability/deployment/monetization/safety categories
  2. run_comparative_eval() — simulates baseline + candidate against worldlets, scores per-dimension via _simulate_agent_scores(), computes weighted average, picks winner + recommendation
  3. POST /v1/eval/compare — accepts {baseline_id, candidate_id, project_id}, optionally auto-generates rubric if no rubric_id provided
  4. POST /v1/eval/runs/{id}/deploy-candidate — promotes winning candidate to canary
  5. EvalRun model gains eval_mode, baseline_id, baseline_scores, candidate_scores, winner, score_delta, recommendation
  6. Frontend Eval.tsx — side-by-side score table with per-dimension deltas, winner badge, "Deploy Candidate" button
  7. Demo flow — creates baseline trace, auto-generates 6-dimension rubric, runs comparative eval (baseline ~0.95 vs candidate ~0.98 → deploy_candidate)

41 tests passing (36 original + 5 new comparative eval tests).

Link to Devin session: https://app.devin.ai/sessions/8fda63f035ef4651a663b6d48a53b3de
Requested by: @Sigil-Wen


Open in Devin Review

Implements the Conway Everyone-Improving Agent OS with:
- Full backend (FastAPI, SQLAlchemy, Pydantic v2, SQLite)
- 14 API route modules covering the complete system
- 15+ services including budget governor, eval runner, deploy broker
- Mock Sanctuary adapter, mock MPP gateway, mock Cloudflare deploy
- React/TypeScript Conway Cockpit frontend
- 36 passing tests covering privacy, budget, authority, eval, deploy, rewards
- Full demo flow: Observe -> Remember -> Suggest -> Evaluate -> Govern ->
  Deploy -> Monetize -> Receipt -> Learn -> Reward -> Upgrade
- Makefile: install, dev, test, seed, demo commands

Co-Authored-By: Sigil Wen <sigil.w3n@gmail.com>
@devin-ai-integration

Copy link
Copy Markdown
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment, CI, and merge conflict monitoring

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 potential issues.

View 6 additional findings in Devin Review.

Open in Devin Review

Comment on lines +99 to +101
refund = entry.amount_reserved - actual_cost
if refund > 0:
project.treasury_balance += refund

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 settle_budget silently ignores cost overruns, leaving treasury balance overstated

When actual_cost > amount_reserved, the refund variable is negative, so the if refund > 0 guard at line 100 prevents any balance adjustment. Since reserve_budget (budget_governor.py:84) only decremented the treasury by amount_reserved, the excess cost (actual_cost - amount_reserved) is never deducted. This leaves project.treasury_balance artificially inflated, which can cause subsequent reserve_budget calls to incorrectly approve reservations that should be denied due to insufficient funds.

Example: reserve $5, settle at $8
  • reserve_budget: treasury goes from $100 → $95 (reserved $5)
  • settle_budget: refund = 5 - 8 = -3, if -3 > 0 is False → no adjustment
  • Treasury stays at $95, but should be $92 (short by $3)
Suggested change
refund = entry.amount_reserved - actual_cost
if refund > 0:
project.treasury_balance += refund
refund = entry.amount_reserved - actual_cost
if refund > 0:
project.treasury_balance += refund
elif refund < 0:
project.treasury_balance += refund # deduct the overrun
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — now deducts the overrun from treasury when actual_cost > amount_reserved.

) -> float:
utility = (
expected_value
+ WEIGHTS["beta_cash"] * math.log(cash_after + EPS)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 rank_candidate crashes with ValueError on low-balance projects due to math.log of negative number

math.log(cash_after + EPS) raises ValueError: math domain error when cash_after + EPS <= 0. This is reachable via the /v1/suggestions/generate/{project_id} API endpoint: generate_suggestions calls rank_candidate with cash_after=project.treasury_balance - 0.03 (line 69) and cash_after=project.treasury_balance - 0.50 (line 113). If a project's treasury balance is near zero (e.g., after budget deductions in RED/YELLOW mode), cash_after + EPS becomes negative, crashing the suggestion generation with an unhandled 500 error.

Suggested change
+ WEIGHTS["beta_cash"] * math.log(cash_after + EPS)
+ WEIGHTS["beta_cash"] * math.log(max(cash_after, 0) + EPS)
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — clamped cash_after to 0 before log() to prevent ValueError on low-balance projects.

- suggestion_engine: clamp cash_after to 0 before log to avoid ValueError
- budget_governor: deduct cost overruns when actual_cost > reserved amount

Co-Authored-By: Sigil Wen <sigil.w3n@gmail.com>
@devin-ai-integration

Copy link
Copy Markdown
Contributor Author

E2E Test Results — All Passed

Ran backend + frontend locally, seeded data, executed 23-step demo loop, then verified invariants via API and UI.

5/5 Tests Passed
Test Result
Backend demo loop completes (23 steps) passed
Privacy: hidden holdouts redacted in public API passed
Budget governor blocks RED-mode deploys (403) passed
Frontend Cockpit renders all pages with data passed
Rewards page shows $0.02 payment + 4 ledger entries passed
Key Assertion Values
demo_loop_completed: True
project_mode: GREEN
reward_entries: 4
raw_receipt_publicly_visible: False
hidden_holdout_publicly_visible: False
r5_govern_blocked: True
r2_upgrade_authorized: True

RED-mode deploy test: POST /v1/deploy/{capsule}/canary → HTTP 403 "Project in RED mode, deploy blocked"

Screenshots

Dashboard:
Dashboard

Suggestions (after Generate):
Suggestions

Rewards & Revenue:
Rewards

Devin session

…rated rubrics

- Add EvalRun model fields: eval_mode, baseline_id, baseline_scores,
  candidate_scores, winner, score_delta, recommendation
- New rubric_generator service: auto-generates rubric dimensions from
  production traces (autonomy receipts, deployment receipts, payments)
- New run_comparative_eval in eval_runner: simulates baseline and
  candidate agents against worldlets, scores per-dimension, picks winner
- POST /v1/eval/compare endpoint: triggers comparative eval with
  optional auto-generated rubric
- POST /v1/eval/runs/{id}/deploy-candidate: promotes winning candidate
- GET /v1/eval/runs: returns comparison data for comparative runs
- Frontend Eval page: side-by-side score table with per-dimension
  deltas, winner badge, and Deploy Candidate button
- Demo flow updated: creates baseline trace, auto-generates rubric,
  runs comparative eval (baseline 0.95 vs candidate 0.98)
- 5 new tests for comparative eval (41 total, all passing)

Co-Authored-By: Sigil Wen <sigil.w3n@gmail.com>
@devin-ai-integration devin-ai-integration Bot changed the title Conway Agent OS v0.1 MVP — Full loop implementation Conway Agent OS v0.1 MVP + Comparative Eval Scoring Jun 6, 2026
@devin-ai-integration

Copy link
Copy Markdown
Contributor Author

Comparative Eval Scoring — E2E Test Results

Ran backend + frontend locally against seeded data with demo flow, tested the comparative eval UI and API end-to-end.

Comparative Eval UI Tests (4/4 passed)
  • Comparative eval cards with score tables — passed: "COMPARATIVE EVALS (2)" header, two cards with "Baseline vs Candidate" title, "AUTO-GENERATED RUBRIC" badge, "WINNER: CANDIDATE" badge, score table (Dimension/Baseline/Candidate/Delta), baseline in orange, candidate in green, Weighted Average row, "Recommendation: DEPLOY CANDIDATE"
  • Deploy Candidate button — passed: clicked button, green success banner: "Deployed: Candidate cap_747742d71c6e approved for deployment"
  • Single-mode eval runs — passed: "EVAL RUNS (1)" with "PASSED" badge, no comparative fields
  • Attestations section — passed: 2 attestations rendered correctly

Eval page with comparative cards

API Edge Case Tests (2/2 passed)
  • POST /v1/eval/compare — passed: created new comparison with 12 auto-generated dimensions, rubric_auto_generated: "true", winner: "candidate"
  • Deploy-candidate rejects non-comparative — passed: HTTP 400, {"detail":"Not a comparative eval run"}

6/6 tests passed. No escalations.

Devin session

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant