feat: add empirical-tuning skill for pre-crystallize hardening by Ageoh · Pull Request #4 · Shmayro/singularity-claude

Ageoh · 2026-04-21T00:49:11Z

What

Adds a new skill, empirical-tuning, to the plugin's skills/ directory, plus a brief CHANGELOG.md entry.

It implements a pre-crystallize hardening loop:

Dispatch independent blind-executor subagents that read the target skill's body for the first time
Have them execute a fixed scenario and return a structured report (unclear points, discretion filled in, checklist pass rate, step count, duration, retries)
Apply a single-theme minimal edit per iteration
Repeat with fresh subagents until a convergence gate is cleared

Why

scoring (and the crystallize gate it feeds) is a post-hoc self-run measurement — the skill's author is also the skill's executor, which tends to hide the author's implicit assumptions. The result is skills that look clean on paper, score well, and then behave inconsistently when a cold-start agent (or another subagent) picks them up.

empirical-tuning adds the missing pre-scoring step: a blind-runner loop that structurally exposes ambiguity in the skill's instructions before crystallize locks the version.

This is complementary to, not a replacement for, scoring:

Phase	Tool	Signal
Pre-scoring	`empirical-tuning` (this PR)	Whether the instructions are robust to a cold reader
Post-hoc	`scoring`	How well the skill performs when the author runs it

How

Origin and adaptation

The skill is adapted from mizchi's empirical-prompt-tuning (attributed in the frontmatter and body). Credit for the workflow and dispatch pattern is mizchi's.

CCC-edition extension

One substantive extension over the upstream source: an explicit Qualitative convergence criteria section under Step 7, addressing a reproducibility question that came up immediately when the skill was dogfooded.

Problem: blind-runners are non-deterministic. Running scenario A twice with different subagents can produce precision scores that drift ±15pt even though the skill itself hasn't changed. Using quantitative thresholds alone (mizchi's workflow already warns about this) leaves the "stop iterating" decision vulnerable to that noise.

Resolution: separate the two axes.

Location convergence (primary signal) — if two or more independent evaluators flag the same section / line / requirement number as unclear, that ambiguity is confirmed and blocks convergence. A single-evaluator flag is a candidate and is allowed to persist.
Severity variance (noise budget) — disagreement on ○ / 部分的 / × for the same location within ±15pt is accepted as evaluator subjectivity. Variance > 15pt is treated as a requirement-wording problem (the requirement, not the skill, needs clarification).

Both axes AND with the existing quantitative Step-7 gates — convergence needs both sides.

Dogfood evidence

Dogfooded over 3 iterations on a real operational skill (vps-monitor, an infra-monitor skill in a private CCC deployment):

Iter 1: two blind-runners independently flagged requirement feat: add empirical-tuning skill for pre-crystallize hardening #4 ("cron vs loop-cycle double-fire prevention") as unclear — confirmed ambiguity on the first pass
Iter 2: targeted edit for the unrelated [critical] requirement; precision for the targeted scenario: 30% → 40%
Iter 3: two new blind-runners, different subagents again — requirement feat: add empirical-tuning skill for pre-crystallize hardening #4 flagged again. Precision stable at ~93% across both scenarios, new unclear points: 0, confirmed-ambiguities list unchanged → convergence criteria met.

Without the location/severity separation, scenario A's 80 → 60 precision drift between iter 1 and iter 2 would have looked like a regression. With it, both runs point at the same underlying ambiguity and the drift is correctly recognised as severity noise.

What this PR does and doesn't touch

Adds skills/empirical-tuning/SKILL.md
Adds an [Unreleased] entry in CHANGELOG.md
Does not modify scoring, crystallizing, or the registry schema
Does not mandate empirical validation for anything — the skill is opt-in

A follow-up PR is planned to wire empirical_validated: true into the crystallizing gate (see "Follow-up" below). That's deliberately split out so this PR can be evaluated on its own merits.

Follow-up

If this is merged, a second PR will propose gating crystallize on empirical_validated, with a grandfather policy of option C: enforce at crystallize time (not at skill-creation or review time), so existing skills are not disturbed unless/until they graduate to crystallized. Rationale: crystallized = production-immutable, which is the right threshold for requiring structural ambiguity checks. Alternatives considered (A: new-skills-only, B: retroactive re-validation of everything) are documented in the planned follow-up's description.

Checklist

SKILL.md has valid YAML frontmatter (name kebab-case, description starting with "Use when…", under 500 chars)
No scripts added in this PR (ShellCheck N/A)
Tested locally by installing into a Claude Code session and dogfooding on vps-monitor for 3 iterations
Conventional Commit: feat: add empirical-tuning skill for pre-crystallize hardening
CHANGELOG.md updated under [Unreleased]

Happy to iterate on the description, the location/severity phrasing, or the Follow-up gate design based on maintainer preference.

Introduces `empirical-tuning` — a blind-runner loop that dispatches independent subagents to execute a target skill, harvests qualitative (unclear points, discretion filled in) + quantitative (checklist pass rate, step count, duration) signals, and refines the skill's prompt across iterations until it converges. The skill is adapted from mizchi's empirical-prompt-tuning (https://github.com/mizchi/chezmoi-dotfiles/blob/main/dot_claude/skills/empirical-prompt-tuning/SKILL.md) with the following CCC-edition extension: - Explicit **Qualitative convergence criteria** — location convergence (two or more evaluators flagging the same section / requirement number) is treated as the primary convergence signal (confirmed ambiguity), while severity variance within +/-15pt is accepted as evaluator noise. Makes the "when to stop iterating" decision reproducible across blind-runner runs. - Integration notes positioning empirical-tuning as a pre-`scoring` harness complementing singularity-claude's self-run scoring loop. Dogfooded on a real skill (vps-monitor) for 3 iterations, the location-convergence criterion identified two confirmed ambiguities that self-run scoring had not surfaced.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add empirical-tuning skill for pre-crystallize hardening#4

feat: add empirical-tuning skill for pre-crystallize hardening#4
Ageoh wants to merge 1 commit intoShmayro:mainfrom
Ageoh:feat/empirical-tuning-skill

Ageoh commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Ageoh commented Apr 21, 2026

What

Why

How

Origin and adaptation

CCC-edition extension

Dogfood evidence

What this PR does and doesn't touch

Follow-up

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant