Build failure fix verification with Agent-as-a-judge by evgenyrp · Pull Request #5793 · mozilla/bugbug

evgenyrp · 2026-03-13T21:45:27Z

This approach uses the same Claude Agents SDK to verify the analysis and fix against the ground truth and produces LLM-as-a-judge style metrics.

Weave results of the full run.
Tracing shows 3 stages of chats: analysis, fix and verification.

I chose to implement it as another optional stage of the fixing pipeline (rather than a separate scorer) because it simplifies code and allows reusing the infra from the agent itself. This includes the SDK, sandboxing, config, logging, tracing in Weave, retries etc.

Having an agentic judge is not only simple to implement, but also allows expanding it in the future to accept even more input and use more tools to evaluate the fix more thoroughly.

The downside of using the same Agent is that it might be biased towards its own solution, even though it doesn't share the context. We can try another model for this in the future, like OpenAI.

Evaluation results

Analysis correctness rate is very high 0.99 (likely biased though)
Fix matches the ground truth for 52% of the cases
The probability of the fix being accepted on code review is even higher 0.62, because the agent explains some fixes can still be considered correct even though they don't match the ground truth
Almost all fixes pass the local build 97%
Total benchmark cost, including the judge, is 80$
4 examples from 85 have failed despite the retries due to reliability issues. It's likely related to the parallelism 8 I've been using to run it.

suhaibmujahid · 2026-03-17T00:51:14Z

bugbug/tools/build_repair/scorer.py

+            "avg_analysis_quality": sum(r["analysis_quality"] for r in scored) / n
+            if n
+            else 0,
+            "analysis_correct_rate": sum(
+                r.get("analysis_correct") is True for r in scored
+            )
+            / n
+            if n
+            else 0,
+            "avg_fix_quality": sum(r["fix_quality"] for r in scored) / n if n else 0,
+            "fix_match_rate": sum(
+                r.get("fix_matches_ground_truth") is True for r in scored
+            )
+            / n
+            if n
+            else 0,
+            "avg_fix_acceptance_probability": sum(
+                r["fix_acceptance_probability"] for r in scored
+            )
+            / n
+            if n
+            else 0,
+            "total_judge_cost_usd": sum(r.get("judge_cost_usd", 0) for r in score_rows),
+            "num_scored": n,
+        }
+        if self.num_trials > 1:
+            summary.update(
+                _pass_at_k(score_rows, self.num_trials, "fix_matches_ground_truth")
+            )
+        logger.info(f"LLMFixMatching summary: {summary}")


This is a bit hard to follow; storing the values in intermediate variables could make it more readable.

Copilot

Pull request overview

Adds an optional “verification/judge” stage to the build-repair evaluation pipeline, using the same Claude Agents SDK to review the agent’s analysis+fix against ground-truth fix commits and to emit LLM-as-a-judge metrics into Weave evaluation outputs.

Changes:

Extend the evaluation runner to pass ground-truth fix commits, run a new verification stage, and record verify results in the model output.
Implement LLMFixMatchingScorer to aggregate verification outputs into summary metrics (quality, match rate, acceptance probability, cost).
Add verification prompt, model/config knobs, and agent-side verification support (including retries and usage accounting).

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
scripts/build_repair_eval.py	Adds ground-truth fix commits input, runs verification stage, wires judge scorer, sets Weave log level env.
bugbug/tools/build_repair/scorer.py	Implements judge-metrics extraction + summarization from `output["verify"]`.
bugbug/tools/build_repair/prompts.py	Updates analysis/fix templates to use worktree paths; adds verification prompt template.
bugbug/tools/build_repair/config.py	Introduces verify model + verify-only allowed tool list.
bugbug/tools/build_repair/agent.py	Adds verification models/schemas, stage retry logic, and `verify()` implementation that produces structured judge output.

Comments suppressed due to low confidence (1)

bugbug/tools/build_repair/agent.py:70

AgentResponse now inherits from UsageStats but also redeclares the same usage fields (cost_usd, num_turns, token counts). This duplication is easy to get out of sync and makes the schema harder to reason about; prefer inheriting and not redeclaring, or remove the base class if you want explicit fields here.

class AgentResponse(UsageStats):
    """Output from a build repair run, including analysis, diff, cost, and build results."""

    summary: str = Field(default="")
    analysis: str = Field(default="")

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

bugbug/tools/build_repair/agent.py

+    ) -> VerifyResponse:
+        out_dir = worktree_path / "repair_agent" / "out" / str(failure.bug_id)
+        out_dir.mkdir(parents=True, exist_ok=True)
+        (out_dir / "agent_fix.diff").write_text(agent_diff)


bugbug/tools/build_repair/agent.py

        transcript: list[dict] = []
        cost = 0.0
        turns = 0
        result_data: dict = {}
        usage: dict = {}



scripts/build_repair_eval.py

+            if result.analysis or result.summary:
+                ground_truth = GroundTruth(gh_fix_commits=gh_fix_commits)
+                verify_result = await self.tool.verify(
+                    failure,
+                    result.diff,
+                    ground_truth,


scripts/build_repair_eval.py

+            if result.analysis or result.summary:
+                ground_truth = GroundTruth(gh_fix_commits=gh_fix_commits)
+                verify_result = await self.tool.verify(
+                    failure,
+                    result.diff,
+                    ground_truth,
+                    worktree_path,
+                    on_message,
+                )


scripts/build_repair_eval.py

    )

    os.environ["WEAVE_PARALLELISM"] = str(args.parallelism)
+    os.environ["WEAVE_LOG_LEVEL"] = "INFO" if args.verbose else "WARNING"


bugbug/tools/build_repair/scorer.py

+            "analysis_quality": None,
+            "fix_matches_ground_truth": None,
+            "fix_quality": None,
+            "fix_acceptance_probability": None,


evgenyrp added 4 commits March 13, 2026 14:22

Add verification stage and retries

c327717

Aggregate LLM-as-a-judge metrics

f463a77

Run verification in evals

2be3e9e

Reduce Weave logging

ba11da1

evgenyrp requested a review from suhaibmujahid March 13, 2026 21:45

suhaibmujahid approved these changes Mar 17, 2026

View reviewed changes

suhaibmujahid requested a review from Copilot March 17, 2026 00:54

Copilot started reviewing on behalf of suhaibmujahid March 17, 2026 00:54 View session

Copilot AI reviewed Mar 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build failure fix verification with Agent-as-a-judge#5793

Build failure fix verification with Agent-as-a-judge#5793
evgenyrp wants to merge 4 commits intomozilla:masterfrom
evgenyrp:build_repair_llmaaj

evgenyrp commented Mar 13, 2026

Uh oh!

suhaibmujahid Mar 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

evgenyrp commented Mar 13, 2026

Evaluation results

Uh oh!

suhaibmujahid Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants