[Draft] PRD: CVS DTNI suite expansion#189
Draft
atnair-amd wants to merge 7 commits into
Draft
Conversation
Architecture-only PRD for the inference + training suite refactor: unified WorkloadAdapter lifecycle, Strategy + Composite topology, RAII resource handles, persistent manifest, typed config schema, and the temporal threshold predicate language. Each section labels what lands now vs what the architecture admits as a future seam. Companion design doc lives separately and is referenced by name in the PRD header.
Concretizes the Strategy + Composite section (§3.2) with the sglang_disagg_factory: SglangServerAdapter reused twice (prefill, decode), router and bench wired into a dag-launch CompositeAdapter, composite-level RouterBalancePredicate, and PdHandoffJoinParser for the cross-role samples carrier. Replaces the 1200-LOC sglang_disagg_lib.py monolith.
Concretizes the Lifecycle / Template Method section (§3.1): - Full typed WorkloadAdapter Protocol with type signatures for each of the six lifecycle methods plus progress_predicate. - Job driver with explicit failure-category classification at each raise boundary (setup / safety / liveness / pattern-matched), RAII teardown in finally, and the _await_with_progress polling loop that distinguishes safety_violation from liveness_failure. - Three-line driver-level dispatch showing the registry lookup.
The original architecture PRD (cvs-dtni-suite-expansion-prd.md) was discussed in review and superseded by a scoped v1 implementation spec: - docs/prd/cvs-dtni-v1-spec.md — eight workstreams that replace today's DTNI stack (lifecycle + adapters + typed configs + manifest + binder + pytest layer + security/correctness fixes + tooling). The original PRD is preserved in git history (commits 625c8f2, f6c84ba, 0d0dd1a on this branch). Anything from it that is still load-bearing for v1 readers has been absorbed into the appropriate workstream description; the spec does not reference the deleted PRD.
511a90e to
10f0901
Compare
The v1 spec (cvs-dtni-v1-spec.md) is dense prose; reviewer feedback asked for a more digestible PR-body version with snippets, mermaid diagrams, file outlines, and concrete walkthroughs. This new file becomes the PR body content; the spec stays as the long-form prose reference. pr-body.md adds: - end-to-end data flow mermaid diagram - sglang before/after with real LOC numbers - one config -> N pytest IDs walkthrough (cvs plan + pytest --collect-only) - lifecycle + failure-classification mermaid diagram with Protocol code and Job.run() body - before/after lib directory tree with LOC counts; pytest tier tree; class hierarchy mermaid - marker derivation table - cluster + binder walkthrough with the insufficient-nodes skip case - sample manifest.json (~50 lines, real values), samples/trajectory parquet column schemas, events vocabulary table - pandas snippet for cross-run P99 TTFT regression by git SHA - three sweep walkthroughs (cartesian, paired-topology, constraint-validated) - six Threshold predicates as YAML - end-to-end safety_violation failure walkthrough - W7 before/after table for the 8 security/correctness fixes - workstream DAG mermaid Anchor links throughout point back into v1-spec.md for full prose.
Replaces the prior "sledgehammer before/after" framing with a forward- looking overview of what the redesign enables, organized into seven sections per reviewer guidance: 1. Code structure and design philosophy (uniform lifecycle + factory/ registry + state-on-disk; flow diagram; Protocol + class hierarchy; "framework emits, CVS retains" + "one workload run, many sliceable claims" principles) 2. Tiered tests (the six tiers with examples; why tier structure makes the matrix queryable) 3. Pytest invocation and lifecycle (the matrix story end-to-end with the lifecycle mermaid, marker derivation table, sample test ID showing every axis, three CLI slicing queries, workload_run fixture sketch) 4. Config files - categories of dials (annotated YAML with sections; one-paragraph description of each dial category: identity, workload, model, knobs, params, sweep, benchmarks, thresholds, topology, secrets) 5. Sweeps (three semantics with examples; how cells propagate into pytest parametrize IDs; cvs plan dry-run) 6. Metrics and benchmarks supported per framework (two matrix tables: captured metrics per framework, benchmarks per framework; six Threshold predicates as YAML; how new metrics/benchmarks/predicates are added cheaply) 7. Manifest and sidecars (directory layout; sample manifest.json; samples/trajectory schemas; events vocabulary; cross-run pandas regression query; why the design choices enable cheap re-verify, re-parse, and dashboard consumption) Dropped: sglang before/after hook, repo file-tree before/after, dedicated cluster/binder section, standalone failure walkthrough, workstream DAG. W7 security fixes moved to appendix. Adds prose throughout to frame the snippets and diagrams; the goal of the overview is to show philosophy and capabilities, not enumerate line counts.
The pr-body.md adapter-contract block described a threaded AdapterRun object with rich typed params/returns (Context, WorkloadResult, Manifest, Threshold, Verdict) that never shipped. The real WorkloadAdapter Protocol threads a single ctx (RunContext) through all seven methods and returns None/List, with progress_predicate declared ahead of await_completion. Also document the multi-role launch/readiness plumbing (_launch_role, _wait_http_pool) that PR-A2 (#214) moved into BaseWorkloadAdapter, in both the base-adapter prose and the adapter-tree diagram, and align the W1 method list ordering in the spec.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
CVS DTNI v1 — refactor overview
Status: Draft for team review. Full prose lives in
cvs-dtni-v1-spec.md; this file is the PR body and the entry point.This PR is a refactor of the DTNI (data-center training and inference) suite — turning a fork-per-workload structure (one Python wrapper per
(framework, model, single/distributed)tuple, one monolithic library per framework) into a uniform lifecycle-driven workload runner. The goal of this overview is to show what the redesign enables, not to enumerate every line that moves. Four capabilities are the headline:(framework, model), Pydantic-validated, fail-fast on typos.Reading guide — seven sections, ~15 min end-to-end:
An appendix covers adoption, security/correctness fixes (W7), and the open reviewer decision.
§1. Code structure and design philosophy
The redesign rests on three primitives.
Three primitives
A uniform lifecycle. Every workload — training or inference, single-role or multi-role — executes the same six phases:
prepare → launch → await_completion → parse → verify → teardown. TheJobdriver runs this lifecycle the same way for every workload; per-framework specialization lives entirely inside the adapter that implements those six methods.A factory + registry. A typed config selects which adapter handles it.
INFERENCE_REGISTRYandTRAINING_REGISTRYmapframework: vllm(ormegatron,sglang_disagg, etc.) to the concrete adapter class. TheJobdriver never branches on framework name; adding a new framework is one new adapter + one registry line.State on disk, not in memory. Every cell of a run produces a content-addressable directory containing
manifest.json+ Parquet sidecars + raw logs. Tests, dashboards, and CI consumers all read from that directory. Nothing depends on module-level Python state surviving a process exit.The flow
flowchart LR cfg["Typed config<br/>(framework: vllm)"] --> reg["Registry<br/>(INFERENCE / TRAINING)"] reg --> adapter["Concrete adapter<br/>(VllmAdapter, MegatronAdapter, ...)"] adapter --> job["Job driver<br/>(6-step lifecycle)"] job --> dir["Per-run dir on disk<br/>(manifest + parquet + logs)"] dir --> tests["pytest functions<br/>(read manifest, fire assertions)"] dir --> exp["cvs export<br/>(flatten to fact.parquet)"] exp --> nb["pandas / DuckDB / dashboard"]The dashed arrows from
dirindicate that the manifest tree is the single source of truth: pytest functions, cross-run exports, and any future dashboard all read the same artifacts.The adapter contract
The factory hands a registered class to the Job driver; the Job driver calls these seven methods in order:
BaseWorkloadAdapterprovides concrete defaults forteardown(always capture logs + dmesg + GPU state, thendocker rmby label),await_completion(poll the predicate with timeout), andprepare(no-op). Most adapters override 3 of the 7 methods.flowchart TD proto["WorkloadAdapter (Protocol)"] proto --> base["BaseWorkloadAdapter<br/>(concrete defaults: teardown, await_completion, prepare)"] base --> vllm["VllmAdapter"] base --> imax["InferenceMaxAdapter"] base --> sgl["SglangDisaggAdapter"] base --> xdit["PytorchXditAdapter"] base --> meg["MegatronAdapter"] base --> jax["JaxAdapter"]Design philosophy
Two principles drove the choices above.
Framework emits → CVS retains. Today the frameworks already emit rich data — per-request JSONL (vLLM), per-step trajectories (Megatron, JAX), Prometheus metrics (vLLM, sglang), per-batch JSON (InferenceMAX), per-step latency tracelogs (xDiT). Today CVS tail-greps the console and discards everything else. The redesign routes the framework's native emission directly into Parquet sidecars; nothing is invented and nothing is lost. Adding a metric is a
parse()change, not a new pipeline.One workload run → many sliceable claims. A single cell of a sweep produces one manifest. Many pytest test functions (logistics, framework-specific, benchmark-specific, model-specific) read that one manifest and assert independent properties. Failure of one assertion doesn't invalidate the run; the manifest is durable; rerunning a subset of claims against an existing manifest becomes one CLI flag (see §7).
The abstraction is intentionally shallow. If a hypothetical future workload needs to override all seven adapter methods, the abstraction has failed for that workload and the right move is to refactor at that point — not build for it speculatively now.
Full prose: W1, W2.
§2. Tiered tests — what they are and why
A "test" in v1 isn't "did the workload run end-to-end and produce one pass/fail line." It's a layered stack of independent claims about the same workload run. Tier = directory under
cvs/tests/= abstraction layer of the claim.test_image_pullable,test_container_up,test_role_ready,test_no_orphans,test_dmesg_cleantest_loss_finite,test_request_success_rate,test_no_5xx_bursttest_per_rank_step_sync,test_no_straggler,test_router_balancetest_aiter_flags_active,test_xla_flags_applied,test_attention_backend_matchesbenchmarks:test_throughput_min,test_ttft_p99,test_convergence,test_goodputtest_quant_conversion_consistent(gpt-oss-120b)How tiers are collected
The framework collects tier 1 for every config. Tiers 2–4 are gated by a
collect-skiphook inpytest_collection_modifyitemsthat compares each test's tier predicate against the cell's config (workload_kind,topology,framework) and deselects mismatched items (not skipped — deselected, so the report stays clean). Tier 5 is opt-in via the config'sbenchmarks: [...]list and a@requires_benchmark("name")decorator. Tier 6 is rare and routes bymodel:.A typical inference cell collects ~13 test IDs (5 logistics + 2 inference-kind + 2 framework + 3 benchmark + 1 model). A typical training cell collects ~10.
Why tier structure matters
Slicing. A reviewer who wants only "did training loss diverge anywhere last night?" runs
pytest -m "tier_2 and benchmark_loss_finite"and gets exactly those claims, without needing to know which configs declaredloss_finiteas a check. A user investigating an HF token leak runspytest -m "tier_1 and not skipped_insufficient_nodes"to see only logistics across the whole nightly sweep. Tiers are the structure that makes the test matrix queryable rather than monolithic.Full prose: W6.
§3. Pytest invocation and lifecycle — the matrix story
What happens when a user runs
cvs runThe user types:
Internally:
extra = "forbid", so typos fail here).workload_runsession-scoped fixture, and fires the collected test functions against the resulting manifest.pytest_terminal_summaryhook aggregates verdicts across all cells.The lifecycle
flowchart TD prep["prepare()"] --> lau["launch()"] lau --> aw["await_completion()<br/>(polls progress_predicate)"] aw --> par["parse()"] par --> ver["verify()<br/>(evaluates thresholds)"] ver --> td["teardown()<br/>(RAII; always runs)"] prep -.->|"raises"| setup["setup_failure"] lau -.->|"raises"| setup aw -.->|"predicate broke"| safety["safety_violation"] aw -.->|"timeout"| liveness["liveness_failure"] par -.->|"pattern hit"| pattern["failure_pattern_matched"] ver -.->|"threshold False"| verif["verification_failure"] setup --> td safety --> td liveness --> td pattern --> td verif --> tdThe 6-step body is identical for every workload — no
if mode == "training"branching in the driver. Failures are classified at the boundary where they originate;teardownalways runs infinally. The five failure categories map to actionable next steps:setup_failuremeans your config or environment is wrong;safety_violationmeans the workload broke its own invariants mid-run (NaN loss, server health probe failing, etc.);verification_failuremeans it ran cleanly but missed a threshold.Markers — the matrix surface
Pytest markers are auto-derived from config fields at collection time. This is the surface a user actually queries:
frameworkframework_<name>vllm→framework_vllmmodelmodel_<name>(underscores normalized)gpt-oss-120b→model_gpt_oss_120bworkload_kindworkload_<kind>inference→workload_inferencetopologytopology_<kind>disagg→topology_disaggtarget_gpugpu_<family>mi355x→gpu_mi355xknobs.<key>(scalar)knob_<key>_<value>attention: aiter→knob_attention_aiterbenchmarks: [...](list)benchmark_<name>per entry[throughput, ttft_p99]→benchmark_throughput,benchmark_ttft_p99tier_Ncvs/tests/benchmarks/→tier_5skipped_<reason>insufficient_nodes→skipped_insufficient_nodesRegistered via
pytest_configureso-mqueries don't emit unknown-marker warnings. List-valued config fields fan out into multiple markers.What a test ID looks like
Every axis a benchmark engineer might want to slice on — framework, model, GPU, attention knob, quant knob, sweep cell (
balanced-conc64) — appears in the parametrize bracket. The test function name (test_ttft_p99) names the claim. The marker set on this item includesframework_vllm,model_gpt_oss_120b,gpu_mi355x,knob_attention_aiter,knob_quant_fp4,workload_inference,topology_single,benchmark_ttft_p99,tier_5.Three real CLI queries
One workload run, many independent claims
The
workload_runfixture incvs/tests/conftest.pyis session-scoped. For each cell, it instantiates the adapter (via the registry), runsJob.run()once, and yields the resulting manifest. Every test function for that cell then consumes the same manifest object:Thirteen test IDs per cell does not mean thirteen workload launches — it means one launch and thirteen independent verdicts. Most tests are pure manifest-reads (
assert manifest.scalars["ttft_p99_ms"] <= threshold.value), so post-launch verification adds milliseconds per assertion.Full prose: W1, W6.
§4. Config files — categories of dials
Today's config story is fragmented: cluster JSON declares hostnames, the test wrapper hard-codes role lists, the per-framework library applies defaults via
dict.setdefault, threshold values live in a stringified-key dict ("ISL=1024,OSL=1024,TP=8,CONC=64"), and typos silently fall through to defaults. v1 collapses all of this into one Pydantic-validated YAML per(framework, model). The schema isextra = "forbid", so any unknown field is amodel_validate()error at parse time.Anatomy of a config
Each category, one paragraph
Identity —
schema_version,test_id,target_gpu.target_gpuis asserted againstGpuPlatform.detect()at config load; running ami355x-targeted config on anmi300xcluster is a fail-fast at config load, not a 20-minute crash.Workload —
framework,workload_kind,topology. The framework Literal routes throughINFERENCE_REGISTRYorTRAINING_REGISTRY.topology.rolesdeclares what the workload needs (count, GPUs per node, optional label selector); the binder maps roles onto cluster nodes at run time. The same config runs on any cluster that has enough nodes matching the selector — no hostname is ever baked into the config.Model — single string. Becomes the
model_<name>marker; routes tier-6 model-specific tests.Knobs — first-class dict for backend-stack details that benchmark engineers care to slice on. Currently used:
attention(aiter/fa/te),quant(fp4/fp8/bf16),backend(engine variant such asvllm-nativevsmooncakevssglang-native),fused_moe(kernel variant). Each becomes aknob_<key>_<value>marker; slicing the matrix on "all Mooncake configs" is one-m "knob_backend_mooncake"query.Params — framework-specific scalars. Inference:
tensor_parallelism,max_model_length,num_prompts. Training:tp/pp/dp/fsdpparallelism degrees,micro_batch_size,sequence_length. Per-framework Pydantic classes (VllmParams,MegatronParams, …) own validation; Megatron has a validator that assertsproduct(parallelism) == total_gpus, so an invalid combo is caught at parse.Sweep — declarative axis expansion; full coverage in §5.
Benchmarks — opt-in list naming which tier-5 claim families this config asks for. Configs that don't list
convergencewon't have thetest_convergencefunction collected for them, even though the function exists in the test tree.Thresholds — list of typed predicates that name a metric, an operator, and a target value or window. Six kinds:
Percentile,Monotonicity,Convergence,Stability,Rate,Goodput. Direction always comes from the explicitop:field — never inferred from the metric name (today, the substring"ms"flips the comparison; a futurelatency_secondsfield would invert).Topology requirements — covered above under Workload. Worth saying again: the cluster file is a pool of nodes only (hostnames, GPUs, labels). All role assignment happens at run time in the binder, per cell.
Secrets —
SecretValuewrapper. Stringification redacts (<SecretValue label=hf_token>);.reveal()is only invoked at env-file write time inside the container. HF token never appears in command-line logs.Full prose: W3, W5.
§5. Sweeps — how the matrix expands
A sweep declares one config that expands into N cells. Three semantics, all YAML-driven:
name:that becomes the parametrize ID.SweepParamsPydantic classes enforce invariants (e.g. parallelism product must equal GPU count).Topology-changing axes (P/D split, node count, parallelism degrees) carry a per-cell
topologyblock; the binder re-evaluates node assignments per cell.Example A — Cartesian (the common case)
→ 6 cells:
[balanced-conc16],[balanced-conc32],[balanced-conc64],[prefill_heavy-conc16],[prefill_heavy-conc32],[prefill_heavy-conc64].Example B — Paired with topology change (sglang P/D split)
→ 4 cells; binder re-evaluates per cell because each
pd_splitcarries its own topology block.Example C — Constraint-validated (Megatron parallelism)
→ 6 cells. The
product(parallelism) == total_gpusconstraint is a Pydantic validator onMegatronSweepParams; nonsense combos fail atmodel_validate(), not 20 minutes into the run.How sweeps propagate into pytest
Each cell becomes a pytest parametrize ID via its
name:field (or an auto-derived name from scalar values). The full pipeline:A cell that the cluster can't satisfy gets a manifest with
status: skippedand askipped_<reason>marker on its pytest items. The cell still appears incvs planoutput and remains queryable inpytest -m. The matrix degrades gracefully on under-resourced clusters — useful for dev boxes.Dry-running the matrix
cvs plan --cluster cluster.json --config foo.yamlis the matrix-preview command. Same code path ascvs runup to the point where theJobdriver would calladapter.prepare(), then prints what would happen and exits. Output includes: cells, per-cell role-to-host bindings, selected test functions per cell, skip reasons, estimated wall time. Useful to catch "config doesn't fit cluster" in 2 seconds rather than 30 minutes.Full prose: W5, W8.
§6. Metrics and benchmarks supported per framework
The "framework emits → CVS retains" principle (§1) means each framework's native telemetry routes directly into the manifest's Parquet sidecars. The contract is uniform across adapters:
samples.parquetis request-grained or sample-grained (inference);trajectory.parquetis time-grained (training, or inference time-series). Benchmarks (tier-5 claims) opt in per config.Captured metrics per framework
samples.parquetcolumnstrajectory.parquetseriesrequest_id,ttft_ms,tpot_ms,itl_ms,e2el_ms,output_tokensqueue_depth,memory_pressure,gpu_utilkv_transfer_ms(P→D handoff),prefill_done_ns,decode_start_nsrouter_queue,decode_kv_cache_util,per_role_startup_mslatency_ms,tokens,throughput_tpsgeneration_ms,prompt_id,seed,num_inference_stepslatency_mslatency_ms,frame_idx,bitrate_kbpsfps,aggregate_bitrateloss,throughput,step_time_ms,grad_norm,mem_used_gb(per-rank)loss,throughput,step_time_ms, per-host metrics from coordinatorLong-format Parquet means adding a new metric is a new row, not a new column — no schema migration on existing manifests. A new field that's already in the framework's emission costs one line in
parse()and zero elsewhere.Benchmarks (tier-5 claims) supported per framework
throughputRatettft_p99Percentiletpot_p99Percentileitl_p99PercentilegoodputGoodputhandoff_latencyPercentilerouter_balanceimage_throughputRatestep_time_p99Percentilevideo_throughputRateconvergenceConvergenceloss_finitemonotonic_lossMonotonicitystep_time_stabilityStabilityno_stragglerThreshold predicates
All six kinds, as they appear in config:
Each evaluates against the manifest's
samplesortrajectorycarriers and emits aVerdictrow withexpected,actual,passed,margin. Themarginfield powers regression alerts — "P99 TTFT margin shrank from +12 ms to +2 ms over the last 10 runs" is one DuckDB query away (§7).Adding new coverage
The three common add-flows are intentionally cheap:
parse(). No schema migration. Existing manifests don't know about the new metric; new manifests do.benchmarks: [...]in the YAML. The matching tier-5 test function collects automatically.Full prose: W2, W3.
§7. Manifest and sidecars — durable runs and regression analysis
The manifest is the contract between "the workload ran" and every downstream consumer (tests, dashboards, regression analysis). It exists on disk at a content-addressable path, survives pytest's process, and is the single artifact every test function reads. Today, a CVS run prints results to stdout and forgets them; v1 makes the run a queryable record.
Per-run directory layout
Content-addressable directory key =
<short_hash>of (workload-defining inputs + framework image digest + bindings). Same config + same cluster always lands at the same path.Sample
manifest.json(abbreviated, real values){ "schema_version": "1.0", "run_id": "0193a8e2-71c1-7e0f-9c1a-7d5e8e1f4a02", "test_id": "vllm_gpt_oss_120b_mi355x_aiter", "cell_id": "balanced-conc64", "config_hash": "sha256:91a2...e44b", "workload_hash": "sha256:7d3a...b21f", "verification_hash": "sha256:9c1e...8f12", "experiment_id": "vllm/gpt-oss-120b/fp4+aiter+vllm-native/mi355x", "cvs_git_sha": "a4f1e2c", "framework_image_digest": "sha256:7d3a...b21f", "framework_versions": {"vllm": "0.10.2", "torch": "2.7.1", "rocm": "6.4.0"}, "timestamp_start": "2026-05-28T20:01:08Z", "timestamp_end": "2026-05-28T20:14:52Z", "hosts": [{"hostname": "n1", "ip": "10.0.0.11", "role": "server"}], "model_descriptor": {"hf_repo": "openai/gpt-oss-120b", "precision": "fp4"}, "phases": { "prepare": {"duration_s": 4.3, "status": "ok"}, "launch": {"duration_s": 41.7, "status": "ok"}, "await": {"duration_s": 720.0, "status": "ok"}, "parse": {"duration_s": 1.8, "status": "ok"}, "verify": {"duration_s": 0.1, "status": "failed"}, "teardown": {"duration_s": 6.9, "status": "ok"} }, "status": "failed_verification", "failure": { "category": "verification_failure", "originated_in_phase": "verify", "message": "P99 TTFT 73.4ms exceeds threshold 50.0ms" }, "verdicts": [ {"kind": "Percentile", "metric": "ttft_ms", "op": "<=", "expected": 50.0, "actual": 73.4, "passed": false, "margin": -23.4}, {"kind": "Percentile", "metric": "tpot_ms", "op": "<=", "expected": 25.0, "actual": 14.2, "passed": true, "margin": 10.8}, {"kind": "Rate", "metric": "throughput", "op": ">=", "expected": 1200.0, "actual": 1318.0, "passed": true, "margin": 118.0} ], "result": {"scalars": {"ttft_p99_ms": 73.4, "tpot_p99_ms": 14.2, "throughput_tps": 1318.0}}, "samples_path": "samples.parquet", "trajectory_path": "trajectory.parquet", "events_path": "events.jsonl" }Sidecar schemas
samples.parquet(long-format — one row per request/sample):request_idtsttft_mstpot_msitl_mse2el_msoutput_tokensrolehosttrajectory.parquet(long-format — one row per (step, metric)):step100ts2026-05-28T20:05:00Zmetric"loss","throughput_tps","router_queue"value4.21role"worker","router"host"n1"events.jsonl— closed vocabulary, one line per event:ts)prepare.start/prepare.donephase_duration_slaunch.container_up/launch.role_readyrole,hoststepstep,loss,throughputrequestrequest_id,ttft_ms, …safety.violatedpredicate,detailpattern.matchedpattern_id,source,lineparse.donesamples_rows,trajectory_rowsverify.failedmetric,actual,expected_maxteardown.doneAdding an event is a schema change reviewed in PR, not a free-for-all
log.info.Cross-run regression analysis (the data-science seam)
Three lines of pandas catches a regression. Same Parquet file feeds a Grafana panel, a Streamlit dashboard, or a nightly alert script. No CVS service required.
Why the manifest design makes future capabilities cheap
Three things fall out of this schema for free:
Re-verify without re-running. The manifest splits
workload_hash(workload-defining inputs: framework, image digest, model, dataset, knobs, params, bindings, seed) fromverification_hash(thresholds + pattern catalog). If only thresholds change, a future--reuse-manifestsflag can re-evaluate verdicts against cachedsamples.parquetand rewrite the verdicts block — no workload launch. The hashes are recorded from day one even though the flag isn't shipped in v1, so the feature lands later as ~100 LOC with no historical-manifest migration.Re-parse logs into new metrics. Raw logs persist in
logs/. If a new metric becomes valuable post-hoc (e.g. a tail-latency percentile we forgot to capture), a--reparseflag can re-run the current parser against the saved logs and rewritesamples.parquet. Same content-addressable path; same manifest gets a new parse-time verdict block.Dashboards consume Parquet directly. Long-format columnar layout means a new metric becomes a new row group, not a schema migration. Cross-run aggregations are DuckDB one-liners. No bespoke ingestion pipeline needs to be built before the data is queryable.
Full prose: W4, W8.