Conway Agent OS v0.1 MVP + Comparative Eval Scoring#328
Conway Agent OS v0.1 MVP + Comparative Eval Scoring#328devin-ai-integration[bot] wants to merge 3 commits into
Conversation
Implements the Conway Everyone-Improving Agent OS with: - Full backend (FastAPI, SQLAlchemy, Pydantic v2, SQLite) - 14 API route modules covering the complete system - 15+ services including budget governor, eval runner, deploy broker - Mock Sanctuary adapter, mock MPP gateway, mock Cloudflare deploy - React/TypeScript Conway Cockpit frontend - 36 passing tests covering privacy, budget, authority, eval, deploy, rewards - Full demo flow: Observe -> Remember -> Suggest -> Evaluate -> Govern -> Deploy -> Monetize -> Receipt -> Learn -> Reward -> Upgrade - Makefile: install, dev, test, seed, demo commands Co-Authored-By: Sigil Wen <sigil.w3n@gmail.com>
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
| refund = entry.amount_reserved - actual_cost | ||
| if refund > 0: | ||
| project.treasury_balance += refund |
There was a problem hiding this comment.
🔴 settle_budget silently ignores cost overruns, leaving treasury balance overstated
When actual_cost > amount_reserved, the refund variable is negative, so the if refund > 0 guard at line 100 prevents any balance adjustment. Since reserve_budget (budget_governor.py:84) only decremented the treasury by amount_reserved, the excess cost (actual_cost - amount_reserved) is never deducted. This leaves project.treasury_balance artificially inflated, which can cause subsequent reserve_budget calls to incorrectly approve reservations that should be denied due to insufficient funds.
Example: reserve $5, settle at $8
- reserve_budget: treasury goes from $100 → $95 (reserved $5)
- settle_budget: refund = 5 - 8 = -3,
if -3 > 0is False → no adjustment - Treasury stays at $95, but should be $92 (short by $3)
| refund = entry.amount_reserved - actual_cost | |
| if refund > 0: | |
| project.treasury_balance += refund | |
| refund = entry.amount_reserved - actual_cost | |
| if refund > 0: | |
| project.treasury_balance += refund | |
| elif refund < 0: | |
| project.treasury_balance += refund # deduct the overrun |
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
Fixed — now deducts the overrun from treasury when actual_cost > amount_reserved.
| ) -> float: | ||
| utility = ( | ||
| expected_value | ||
| + WEIGHTS["beta_cash"] * math.log(cash_after + EPS) |
There was a problem hiding this comment.
🔴 rank_candidate crashes with ValueError on low-balance projects due to math.log of negative number
math.log(cash_after + EPS) raises ValueError: math domain error when cash_after + EPS <= 0. This is reachable via the /v1/suggestions/generate/{project_id} API endpoint: generate_suggestions calls rank_candidate with cash_after=project.treasury_balance - 0.03 (line 69) and cash_after=project.treasury_balance - 0.50 (line 113). If a project's treasury balance is near zero (e.g., after budget deductions in RED/YELLOW mode), cash_after + EPS becomes negative, crashing the suggestion generation with an unhandled 500 error.
| + WEIGHTS["beta_cash"] * math.log(cash_after + EPS) | |
| + WEIGHTS["beta_cash"] * math.log(max(cash_after, 0) + EPS) |
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
Fixed — clamped cash_after to 0 before log() to prevent ValueError on low-balance projects.
- suggestion_engine: clamp cash_after to 0 before log to avoid ValueError - budget_governor: deduct cost overruns when actual_cost > reserved amount Co-Authored-By: Sigil Wen <sigil.w3n@gmail.com>
E2E Test Results — All PassedRan backend + frontend locally, seeded data, executed 23-step demo loop, then verified invariants via API and UI. 5/5 Tests Passed
Key Assertion ValuesRED-mode deploy test: |
…rated rubrics
- Add EvalRun model fields: eval_mode, baseline_id, baseline_scores,
candidate_scores, winner, score_delta, recommendation
- New rubric_generator service: auto-generates rubric dimensions from
production traces (autonomy receipts, deployment receipts, payments)
- New run_comparative_eval in eval_runner: simulates baseline and
candidate agents against worldlets, scores per-dimension, picks winner
- POST /v1/eval/compare endpoint: triggers comparative eval with
optional auto-generated rubric
- POST /v1/eval/runs/{id}/deploy-candidate: promotes winning candidate
- GET /v1/eval/runs: returns comparison data for comparative runs
- Frontend Eval page: side-by-side score table with per-dimension
deltas, winner badge, and Deploy Candidate button
- Demo flow updated: creates baseline trace, auto-generates rubric,
runs comparative eval (baseline 0.95 vs candidate 0.98)
- 5 new tests for comparative eval (41 total, all passing)
Co-Authored-By: Sigil Wen <sigil.w3n@gmail.com>
Comparative Eval Scoring — E2E Test ResultsRan backend + frontend locally against seeded data with demo flow, tested the comparative eval UI and API end-to-end. Comparative Eval UI Tests (4/4 passed)
API Edge Case Tests (2/2 passed)
6/6 tests passed. No escalations. |

Summary
Full Conway Agent OS implementation (Observe → Remember → Suggest → Evaluate → Govern → Deploy → Monetize → Receipt → Learn → Reward → Upgrade) with comparative eval scoring — baseline vs candidate agents are simulated against auto-generated rubrics and scored so the user can deploy the best candidate.
Comparative Eval (new in latest commit)
Eval runs now go beyond pass/fail attestations:
rubric_generator.py— auto-generatesRubricfrom production traces (AutonomyReceipt,DeploymentReceipt,PaymentReceipt), deriving dimensions across capability/deployment/monetization/safety categoriesrun_comparative_eval()— simulates baseline + candidate against worldlets, scores per-dimension via_simulate_agent_scores(), computes weighted average, pickswinner+recommendationPOST /v1/eval/compare— accepts{baseline_id, candidate_id, project_id}, optionally auto-generates rubric if norubric_idprovidedPOST /v1/eval/runs/{id}/deploy-candidate— promotes winning candidate to canaryEvalRunmodel gainseval_mode,baseline_id,baseline_scores,candidate_scores,winner,score_delta,recommendationEval.tsx— side-by-side score table with per-dimension deltas, winner badge, "Deploy Candidate" button41 tests passing (36 original + 5 new comparative eval tests).
Link to Devin session: https://app.devin.ai/sessions/8fda63f035ef4651a663b6d48a53b3de
Requested by: @Sigil-Wen