Build failure fix verification with Agent-as-a-judge#5793
Open
evgenyrp wants to merge 4 commits intomozilla:masterfrom
Open
Build failure fix verification with Agent-as-a-judge#5793evgenyrp wants to merge 4 commits intomozilla:masterfrom
evgenyrp wants to merge 4 commits intomozilla:masterfrom
Conversation
suhaibmujahid
approved these changes
Mar 17, 2026
Comment on lines
+187
to
+216
| "avg_analysis_quality": sum(r["analysis_quality"] for r in scored) / n | ||
| if n | ||
| else 0, | ||
| "analysis_correct_rate": sum( | ||
| r.get("analysis_correct") is True for r in scored | ||
| ) | ||
| / n | ||
| if n | ||
| else 0, | ||
| "avg_fix_quality": sum(r["fix_quality"] for r in scored) / n if n else 0, | ||
| "fix_match_rate": sum( | ||
| r.get("fix_matches_ground_truth") is True for r in scored | ||
| ) | ||
| / n | ||
| if n | ||
| else 0, | ||
| "avg_fix_acceptance_probability": sum( | ||
| r["fix_acceptance_probability"] for r in scored | ||
| ) | ||
| / n | ||
| if n | ||
| else 0, | ||
| "total_judge_cost_usd": sum(r.get("judge_cost_usd", 0) for r in score_rows), | ||
| "num_scored": n, | ||
| } | ||
| if self.num_trials > 1: | ||
| summary.update( | ||
| _pass_at_k(score_rows, self.num_trials, "fix_matches_ground_truth") | ||
| ) | ||
| logger.info(f"LLMFixMatching summary: {summary}") |
Member
There was a problem hiding this comment.
This is a bit hard to follow; storing the values in intermediate variables could make it more readable.
There was a problem hiding this comment.
Pull request overview
Adds an optional “verification/judge” stage to the build-repair evaluation pipeline, using the same Claude Agents SDK to review the agent’s analysis+fix against ground-truth fix commits and to emit LLM-as-a-judge metrics into Weave evaluation outputs.
Changes:
- Extend the evaluation runner to pass ground-truth fix commits, run a new verification stage, and record
verifyresults in the model output. - Implement
LLMFixMatchingScorerto aggregate verification outputs into summary metrics (quality, match rate, acceptance probability, cost). - Add verification prompt, model/config knobs, and agent-side verification support (including retries and usage accounting).
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| scripts/build_repair_eval.py | Adds ground-truth fix commits input, runs verification stage, wires judge scorer, sets Weave log level env. |
| bugbug/tools/build_repair/scorer.py | Implements judge-metrics extraction + summarization from output["verify"]. |
| bugbug/tools/build_repair/prompts.py | Updates analysis/fix templates to use worktree paths; adds verification prompt template. |
| bugbug/tools/build_repair/config.py | Introduces verify model + verify-only allowed tool list. |
| bugbug/tools/build_repair/agent.py | Adds verification models/schemas, stage retry logic, and verify() implementation that produces structured judge output. |
Comments suppressed due to low confidence (1)
bugbug/tools/build_repair/agent.py:70
AgentResponsenow inherits fromUsageStatsbut also redeclares the same usage fields (cost_usd,num_turns, token counts). This duplication is easy to get out of sync and makes the schema harder to reason about; prefer inheriting and not redeclaring, or remove the base class if you want explicit fields here.
class AgentResponse(UsageStats):
"""Output from a build repair run, including analysis, diff, cost, and build results."""
summary: str = Field(default="")
analysis: str = Field(default="")
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| ) -> VerifyResponse: | ||
| out_dir = worktree_path / "repair_agent" / "out" / str(failure.bug_id) | ||
| out_dir.mkdir(parents=True, exist_ok=True) | ||
| (out_dir / "agent_fix.diff").write_text(agent_diff) |
Comment on lines
168
to
173
| transcript: list[dict] = [] | ||
| cost = 0.0 | ||
| turns = 0 | ||
| result_data: dict = {} | ||
| usage: dict = {} | ||
|
|
Comment on lines
+290
to
+295
| if result.analysis or result.summary: | ||
| ground_truth = GroundTruth(gh_fix_commits=gh_fix_commits) | ||
| verify_result = await self.tool.verify( | ||
| failure, | ||
| result.diff, | ||
| ground_truth, |
Comment on lines
+290
to
+298
| if result.analysis or result.summary: | ||
| ground_truth = GroundTruth(gh_fix_commits=gh_fix_commits) | ||
| verify_result = await self.tool.verify( | ||
| failure, | ||
| result.diff, | ||
| ground_truth, | ||
| worktree_path, | ||
| on_message, | ||
| ) |
| ) | ||
|
|
||
| os.environ["WEAVE_PARALLELISM"] = str(args.parallelism) | ||
| os.environ["WEAVE_LOG_LEVEL"] = "INFO" if args.verbose else "WARNING" |
Comment on lines
+152
to
+155
| "analysis_quality": None, | ||
| "fix_matches_ground_truth": None, | ||
| "fix_quality": None, | ||
| "fix_acceptance_probability": None, |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This approach uses the same Claude Agents SDK to verify the analysis and fix against the ground truth and produces LLM-as-a-judge style metrics.
Weave results of the full run.
Tracing shows 3 stages of chats: analysis, fix and verification.
I chose to implement it as another optional stage of the fixing pipeline (rather than a separate scorer) because it simplifies code and allows reusing the infra from the agent itself. This includes the SDK, sandboxing, config, logging, tracing in Weave, retries etc.
Having an agentic judge is not only simple to implement, but also allows expanding it in the future to accept even more input and use more tools to evaluate the fix more thoroughly.
The downside of using the same Agent is that it might be biased towards its own solution, even though it doesn't share the context. We can try another model for this in the future, like OpenAI.
Evaluation results