Add MCP and skill reliability report#357
Open
ozymandiashh wants to merge 1 commit into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
codeburn optimize.Why
Existing optimize findings can show broad MCP waste, context bloat, and low-worth sessions, but they do not answer a capability-level reliability question:
This is useful for MCP and skill tuning because the right action is often not removal. A retry-heavy skill may need clearer instructions. A retry-heavy MCP server may need narrower project scope, a smaller tool set, or better usage guidance. The finding is intentionally framed as correlation, not causation, so users inspect the sessions before changing config.
For example:
Skill reviewerappears in 5 edit turns, and 3 of those edit turns need retries.mcp__ci__runmaps to MCP serverci, and the same 3/5 edit-turn retry pattern appears there.What changed
detectCapabilityReliability()tosrc/optimize.ts.call.mcpTools, normalized from names likemcp__ci__runto servercicall.skillsturn.subCategoryis deliberately not treated as skill evidence, avoiding legacy or classifier-derived false labelstokensSaved.tests/optimize.test.ts.Validation
I validated the behavior by running the real
detectCapabilityReliability()export against controlledProjectSummaryfixtures vianpx tsx --eval. This exercises the actual detector path and prints the expected edge cases.{ "skill_retry_report": { "title": "1 skill correlates with retry-heavy edits", "tokensSaved": 1500, "expectedTokensSaved": 1500, "includesSkill": true, "proof": "5 edit turns using Skill reviewer; 3 retry-heavy turns at 1,000 effective tokens each; shared recovery ceiling is 50%, so 3*1000*0.5 = 1,500 tokens" }, "mcp_retry_report": { "title": "1 MCP server correlates with retry-heavy edits", "tokensSaved": 1500, "expectedTokensSaved": 1500, "includesMcpServer": true, "proof": "same retry pattern attributed from mcp__ci__run to MCP server ci" }, "shared_mcp_skill_turn_cap": { "title": "2 MCP/skill capabilities correlate with retry-heavy edits", "tokensSaved": 1500, "expectedTokensSaved": 1500, "includesBoth": true, "proof": "MCP server ci and Skill reviewer share the same 3 retry-heavy turns; tokensSaved stays 1,500 instead of doubling to 3,000" }, "healthy_guard": { "finding": null, "proof": "1/5 retry-heavy edit turns is below the 50% retry-rate threshold" }, "subcategory_guard": { "finding": null, "proof": "turn.subCategory without actual call.skills metadata does not create a skill reliability finding" }, "readonly_guard": { "finding": null, "proof": "read-only retry turns are ignored because the detector is scoped to edit reliability" } }What this proves:
1,500token estimate.tokensSavedat1,500, not the inflated3,000.turn.subCategoryalone does not create a skill finding.Supporting checks:
./node_modules/.bin/tsc --noEmit --pretty falsenpx vitest run tests/optimize.test.ts— 82 tests passednpm run buildnpm test -- --run— 62 files / 877 tests passedgit diff --checkPASSPASSNotes
mainand uses the existing parsed turn metadata.