feat: add empirical-tuning skill for pre-crystallize hardening#4
Open
Ageoh wants to merge 1 commit intoShmayro:mainfrom
Open
feat: add empirical-tuning skill for pre-crystallize hardening#4Ageoh wants to merge 1 commit intoShmayro:mainfrom
Ageoh wants to merge 1 commit intoShmayro:mainfrom
Conversation
Introduces `empirical-tuning` — a blind-runner loop that dispatches independent subagents to execute a target skill, harvests qualitative (unclear points, discretion filled in) + quantitative (checklist pass rate, step count, duration) signals, and refines the skill's prompt across iterations until it converges. The skill is adapted from mizchi's empirical-prompt-tuning (https://github.com/mizchi/chezmoi-dotfiles/blob/main/dot_claude/skills/empirical-prompt-tuning/SKILL.md) with the following CCC-edition extension: - Explicit **Qualitative convergence criteria** — location convergence (two or more evaluators flagging the same section / requirement number) is treated as the primary convergence signal (confirmed ambiguity), while severity variance within +/-15pt is accepted as evaluator noise. Makes the "when to stop iterating" decision reproducible across blind-runner runs. - Integration notes positioning empirical-tuning as a pre-`scoring` harness complementing singularity-claude's self-run scoring loop. Dogfooded on a real skill (vps-monitor) for 3 iterations, the location-convergence criterion identified two confirmed ambiguities that self-run scoring had not surfaced.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds a new skill,
empirical-tuning, to the plugin'sskills/directory, plus a briefCHANGELOG.mdentry.It implements a pre-crystallize hardening loop:
Why
scoring(and the crystallize gate it feeds) is a post-hoc self-run measurement — the skill's author is also the skill's executor, which tends to hide the author's implicit assumptions. The result is skills that look clean on paper, score well, and then behave inconsistently when a cold-start agent (or another subagent) picks them up.empirical-tuningadds the missing pre-scoring step: a blind-runner loop that structurally exposes ambiguity in the skill's instructions before crystallize locks the version.This is complementary to, not a replacement for,
scoring:empirical-tuning(this PR)scoringHow
Origin and adaptation
The skill is adapted from mizchi's
empirical-prompt-tuning(attributed in the frontmatter and body). Credit for the workflow and dispatch pattern is mizchi's.CCC-edition extension
One substantive extension over the upstream source: an explicit Qualitative convergence criteria section under Step 7, addressing a reproducibility question that came up immediately when the skill was dogfooded.
Problem: blind-runners are non-deterministic. Running scenario A twice with different subagents can produce precision scores that drift ±15pt even though the skill itself hasn't changed. Using quantitative thresholds alone (mizchi's workflow already warns about this) leaves the "stop iterating" decision vulnerable to that noise.
Resolution: separate the two axes.
confirmedand blocks convergence. A single-evaluator flag is acandidateand is allowed to persist.Both axes AND with the existing quantitative Step-7 gates — convergence needs both sides.
Dogfood evidence
Dogfooded over 3 iterations on a real operational skill (
vps-monitor, an infra-monitor skill in a private CCC deployment):[critical]requirement; precision for the targeted scenario: 30% → 40%Without the location/severity separation, scenario A's 80 → 60 precision drift between iter 1 and iter 2 would have looked like a regression. With it, both runs point at the same underlying ambiguity and the drift is correctly recognised as severity noise.
What this PR does and doesn't touch
skills/empirical-tuning/SKILL.md[Unreleased]entry inCHANGELOG.mdscoring,crystallizing, or the registry schemaA follow-up PR is planned to wire
empirical_validated: trueinto thecrystallizinggate (see "Follow-up" below). That's deliberately split out so this PR can be evaluated on its own merits.Follow-up
If this is merged, a second PR will propose gating
crystallizeonempirical_validated, with a grandfather policy of option C: enforce atcrystallizetime (not at skill-creation or review time), so existing skills are not disturbed unless/until they graduate tocrystallized. Rationale:crystallized= production-immutable, which is the right threshold for requiring structural ambiguity checks. Alternatives considered (A: new-skills-only, B: retroactive re-validation of everything) are documented in the planned follow-up's description.Checklist
namekebab-case,descriptionstarting with "Use when…", under 500 chars)vps-monitorfor 3 iterationsfeat: add empirical-tuning skill for pre-crystallize hardeningCHANGELOG.mdupdated under[Unreleased]Happy to iterate on the description, the location/severity phrasing, or the Follow-up gate design based on maintainer preference.