Skip to content

Build failure fix verification with Agent-as-a-judge#5793

Open
evgenyrp wants to merge 4 commits intomozilla:masterfrom
evgenyrp:build_repair_llmaaj
Open

Build failure fix verification with Agent-as-a-judge#5793
evgenyrp wants to merge 4 commits intomozilla:masterfrom
evgenyrp:build_repair_llmaaj

Conversation

@evgenyrp
Copy link
Contributor

This approach uses the same Claude Agents SDK to verify the analysis and fix against the ground truth and produces LLM-as-a-judge style metrics.

Weave results of the full run.
Tracing shows 3 stages of chats: analysis, fix and verification.

I chose to implement it as another optional stage of the fixing pipeline (rather than a separate scorer) because it simplifies code and allows reusing the infra from the agent itself. This includes the SDK, sandboxing, config, logging, tracing in Weave, retries etc.

Having an agentic judge is not only simple to implement, but also allows expanding it in the future to accept even more input and use more tools to evaluate the fix more thoroughly.

The downside of using the same Agent is that it might be biased towards its own solution, even though it doesn't share the context. We can try another model for this in the future, like OpenAI.

Evaluation results

  • Analysis correctness rate is very high 0.99 (likely biased though)
  • Fix matches the ground truth for 52% of the cases
  • The probability of the fix being accepted on code review is even higher 0.62, because the agent explains some fixes can still be considered correct even though they don't match the ground truth
  • Almost all fixes pass the local build 97%
  • Total benchmark cost, including the judge, is 80$
  • 4 examples from 85 have failed despite the retries due to reliability issues. It's likely related to the parallelism 8 I've been using to run it.

@evgenyrp evgenyrp requested a review from suhaibmujahid March 13, 2026 21:45
Comment on lines +187 to +216
"avg_analysis_quality": sum(r["analysis_quality"] for r in scored) / n
if n
else 0,
"analysis_correct_rate": sum(
r.get("analysis_correct") is True for r in scored
)
/ n
if n
else 0,
"avg_fix_quality": sum(r["fix_quality"] for r in scored) / n if n else 0,
"fix_match_rate": sum(
r.get("fix_matches_ground_truth") is True for r in scored
)
/ n
if n
else 0,
"avg_fix_acceptance_probability": sum(
r["fix_acceptance_probability"] for r in scored
)
/ n
if n
else 0,
"total_judge_cost_usd": sum(r.get("judge_cost_usd", 0) for r in score_rows),
"num_scored": n,
}
if self.num_trials > 1:
summary.update(
_pass_at_k(score_rows, self.num_trials, "fix_matches_ground_truth")
)
logger.info(f"LLMFixMatching summary: {summary}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit hard to follow; storing the values in intermediate variables could make it more readable.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an optional “verification/judge” stage to the build-repair evaluation pipeline, using the same Claude Agents SDK to review the agent’s analysis+fix against ground-truth fix commits and to emit LLM-as-a-judge metrics into Weave evaluation outputs.

Changes:

  • Extend the evaluation runner to pass ground-truth fix commits, run a new verification stage, and record verify results in the model output.
  • Implement LLMFixMatchingScorer to aggregate verification outputs into summary metrics (quality, match rate, acceptance probability, cost).
  • Add verification prompt, model/config knobs, and agent-side verification support (including retries and usage accounting).

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
scripts/build_repair_eval.py Adds ground-truth fix commits input, runs verification stage, wires judge scorer, sets Weave log level env.
bugbug/tools/build_repair/scorer.py Implements judge-metrics extraction + summarization from output["verify"].
bugbug/tools/build_repair/prompts.py Updates analysis/fix templates to use worktree paths; adds verification prompt template.
bugbug/tools/build_repair/config.py Introduces verify model + verify-only allowed tool list.
bugbug/tools/build_repair/agent.py Adds verification models/schemas, stage retry logic, and verify() implementation that produces structured judge output.
Comments suppressed due to low confidence (1)

bugbug/tools/build_repair/agent.py:70

  • AgentResponse now inherits from UsageStats but also redeclares the same usage fields (cost_usd, num_turns, token counts). This duplication is easy to get out of sync and makes the schema harder to reason about; prefer inheriting and not redeclaring, or remove the base class if you want explicit fields here.
class AgentResponse(UsageStats):
    """Output from a build repair run, including analysis, diff, cost, and build results."""

    summary: str = Field(default="")
    analysis: str = Field(default="")

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

) -> VerifyResponse:
out_dir = worktree_path / "repair_agent" / "out" / str(failure.bug_id)
out_dir.mkdir(parents=True, exist_ok=True)
(out_dir / "agent_fix.diff").write_text(agent_diff)
Comment on lines 168 to 173
transcript: list[dict] = []
cost = 0.0
turns = 0
result_data: dict = {}
usage: dict = {}

Comment on lines +290 to +295
if result.analysis or result.summary:
ground_truth = GroundTruth(gh_fix_commits=gh_fix_commits)
verify_result = await self.tool.verify(
failure,
result.diff,
ground_truth,
Comment on lines +290 to +298
if result.analysis or result.summary:
ground_truth = GroundTruth(gh_fix_commits=gh_fix_commits)
verify_result = await self.tool.verify(
failure,
result.diff,
ground_truth,
worktree_path,
on_message,
)
)

os.environ["WEAVE_PARALLELISM"] = str(args.parallelism)
os.environ["WEAVE_LOG_LEVEL"] = "INFO" if args.verbose else "WARNING"
Comment on lines +152 to +155
"analysis_quality": None,
"fix_matches_ground_truth": None,
"fix_quality": None,
"fix_acceptance_probability": None,
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants