Skip to content

feat: add empirical-tuning skill for pre-crystallize hardening#4

Open
Ageoh wants to merge 1 commit intoShmayro:mainfrom
Ageoh:feat/empirical-tuning-skill
Open

feat: add empirical-tuning skill for pre-crystallize hardening#4
Ageoh wants to merge 1 commit intoShmayro:mainfrom
Ageoh:feat/empirical-tuning-skill

Conversation

@Ageoh
Copy link
Copy Markdown

@Ageoh Ageoh commented Apr 21, 2026

What

Adds a new skill, empirical-tuning, to the plugin's skills/ directory, plus a brief CHANGELOG.md entry.

It implements a pre-crystallize hardening loop:

  1. Dispatch independent blind-executor subagents that read the target skill's body for the first time
  2. Have them execute a fixed scenario and return a structured report (unclear points, discretion filled in, checklist pass rate, step count, duration, retries)
  3. Apply a single-theme minimal edit per iteration
  4. Repeat with fresh subagents until a convergence gate is cleared

Why

scoring (and the crystallize gate it feeds) is a post-hoc self-run measurement — the skill's author is also the skill's executor, which tends to hide the author's implicit assumptions. The result is skills that look clean on paper, score well, and then behave inconsistently when a cold-start agent (or another subagent) picks them up.

empirical-tuning adds the missing pre-scoring step: a blind-runner loop that structurally exposes ambiguity in the skill's instructions before crystallize locks the version.

This is complementary to, not a replacement for, scoring:

Phase Tool Signal
Pre-scoring empirical-tuning (this PR) Whether the instructions are robust to a cold reader
Post-hoc scoring How well the skill performs when the author runs it

How

Origin and adaptation

The skill is adapted from mizchi's empirical-prompt-tuning (attributed in the frontmatter and body). Credit for the workflow and dispatch pattern is mizchi's.

CCC-edition extension

One substantive extension over the upstream source: an explicit Qualitative convergence criteria section under Step 7, addressing a reproducibility question that came up immediately when the skill was dogfooded.

Problem: blind-runners are non-deterministic. Running scenario A twice with different subagents can produce precision scores that drift ±15pt even though the skill itself hasn't changed. Using quantitative thresholds alone (mizchi's workflow already warns about this) leaves the "stop iterating" decision vulnerable to that noise.

Resolution: separate the two axes.

  • Location convergence (primary signal) — if two or more independent evaluators flag the same section / line / requirement number as unclear, that ambiguity is confirmed and blocks convergence. A single-evaluator flag is a candidate and is allowed to persist.
  • Severity variance (noise budget) — disagreement on ○ / 部分的 / × for the same location within ±15pt is accepted as evaluator subjectivity. Variance > 15pt is treated as a requirement-wording problem (the requirement, not the skill, needs clarification).

Both axes AND with the existing quantitative Step-7 gates — convergence needs both sides.

Dogfood evidence

Dogfooded over 3 iterations on a real operational skill (vps-monitor, an infra-monitor skill in a private CCC deployment):

Without the location/severity separation, scenario A's 80 → 60 precision drift between iter 1 and iter 2 would have looked like a regression. With it, both runs point at the same underlying ambiguity and the drift is correctly recognised as severity noise.

What this PR does and doesn't touch

  • Adds skills/empirical-tuning/SKILL.md
  • Adds an [Unreleased] entry in CHANGELOG.md
  • Does not modify scoring, crystallizing, or the registry schema
  • Does not mandate empirical validation for anything — the skill is opt-in

A follow-up PR is planned to wire empirical_validated: true into the crystallizing gate (see "Follow-up" below). That's deliberately split out so this PR can be evaluated on its own merits.

Follow-up

If this is merged, a second PR will propose gating crystallize on empirical_validated, with a grandfather policy of option C: enforce at crystallize time (not at skill-creation or review time), so existing skills are not disturbed unless/until they graduate to crystallized. Rationale: crystallized = production-immutable, which is the right threshold for requiring structural ambiguity checks. Alternatives considered (A: new-skills-only, B: retroactive re-validation of everything) are documented in the planned follow-up's description.

Checklist

  • SKILL.md has valid YAML frontmatter (name kebab-case, description starting with "Use when…", under 500 chars)
  • No scripts added in this PR (ShellCheck N/A)
  • Tested locally by installing into a Claude Code session and dogfooding on vps-monitor for 3 iterations
  • Conventional Commit: feat: add empirical-tuning skill for pre-crystallize hardening
  • CHANGELOG.md updated under [Unreleased]

Happy to iterate on the description, the location/severity phrasing, or the Follow-up gate design based on maintainer preference.

Introduces `empirical-tuning` — a blind-runner loop that dispatches
independent subagents to execute a target skill, harvests qualitative
(unclear points, discretion filled in) + quantitative (checklist
pass rate, step count, duration) signals, and refines the skill's
prompt across iterations until it converges.

The skill is adapted from mizchi's empirical-prompt-tuning
(https://github.com/mizchi/chezmoi-dotfiles/blob/main/dot_claude/skills/empirical-prompt-tuning/SKILL.md)
with the following CCC-edition extension:

- Explicit **Qualitative convergence criteria** — location
  convergence (two or more evaluators flagging the same section /
  requirement number) is treated as the primary convergence signal
  (confirmed ambiguity), while severity variance within +/-15pt is
  accepted as evaluator noise. Makes the "when to stop iterating"
  decision reproducible across blind-runner runs.
- Integration notes positioning empirical-tuning as a pre-`scoring`
  harness complementing singularity-claude's self-run scoring loop.

Dogfooded on a real skill (vps-monitor) for 3 iterations, the
location-convergence criterion identified two confirmed ambiguities
that self-run scoring had not surfaced.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant