Cerberus

Reference implementation for Stop Testing AI Agents Like Deterministic Code.

Statistical CI/CD for AI agents — using sequential hypothesis testing (SPRT), Wilson score confidence intervals, and Benjamini-Hochberg correction to answer "does this agent work reliably enough?" instead of "did this test pass?"

Quick start

npm install
npm run build
npx cerberus init        # scaffold a starter config
npx cerberus run         # run the test suite

What this demonstrates

An agent is a Bernoulli process: each invocation passes or fails with some unknown probability p. Cerberus runs trials, accumulates statistical evidence, and makes a rigorous pass/fail/inconclusive decision — stopping early when the evidence is clear.

Config → define contracts (assertions) with reliability thresholds:

adapter:
  command: "node ./my-agent.js --scenario {{scenario}}"
  timeout: 30000

studies:
  - name: code-review-quality
    scenario: scenarios/review-task.yaml
    contracts:
      - name: exits-cleanly
        type: code
        assert: "output.meta.exitCode === 0"
        threshold: 0.95
        confidence: 0.95
        trials: 50

Output → pass rates with confidence intervals and early stopping:

Contracts:
  exits-cleanly       PASS  100.0% [CI: 89–100%]  (15 trials, early stop)
  produces-valid-json  PASS   95.0% [CI: 78–99%]   (22 trials, early stop)

Suite: PASS (2/2 contracts satisfied)

Exit codes → 0 pass, 1 fail, 3 inconclusive — designed for CI pipelines.

File map

Which file implements which concept from the blog post:

Concept	File	What it does
SPRT (sequential testing)	`src/stats.ts`	Log-likelihood ratio, Wald boundaries, accept/reject/continue
Wilson score intervals	`src/stats.ts`	Confidence intervals that work at 0% and 100%
Benjamini-Hochberg	`src/stats.ts`	Multiple testing correction (FDR control)
Contract evaluation	`src/contracts.ts`	Sandboxed assertion execution via `vm.runInNewContext()`
LLM judge panels	`src/judges.ts`	Multi-provider judge evaluation with majority vote
Trial execution	`src/runner.ts`	SPRT loop, process spawning, error rate monitoring
YAML config + validation	`src/config.ts`	Zod schema, threshold/confidence validation
Result formatting	`src/output.ts`	Console output, JSON serialization, result persistence
Domain types	`src/types.ts`	`SPRTState`, `ConfidenceInterval`, `ContractResult`, exit codes
CLI entry point	`src/cli.ts`	`run` and `init` commands, exit code mapping
Error types	`src/errors.ts`	`CerberusError`, `ConfigError` with typed exit codes

9 source files, ~1,600 lines. 80 tests in tests/.

Commands

npm run build      # Build with tsup
npm test           # Run tests with vitest
npm run typecheck  # Type-check with tsc
npm run dev        # Run CLI in dev mode with tsx

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
docs		docs
src		src
tests		tests
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json
tsup.config.ts		tsup.config.ts
vitest.config.ts		vitest.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cerberus

Quick start

What this demonstrates

File map

Commands

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Cerberus

Quick start

What this demonstrates

File map

Commands

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages