Preserve MLA attention migration across metrics boundaries by fwyc0573 · Pull Request #3 · NetX-lab/Frontier

fwyc0573 · 2026-06-29T13:08:23Z

PR Description Draft: Preserve MLA attention migration across metrics boundaries

Title

Preserve MLA attention migration across metrics boundaries

Summary

This PR migrates the reviewed feature-attn-sim attention refactor onto a clean origin/main ancestry branch without using --allow-unrelated-histories. It keeps the first PR focused on Frontier attention measurement/simulation, op-level tracing, predictor/runtime metadata, profiling importers, and reproducible equivalence validation.

The migration intentionally excludes the vLLM bridge and the current task-memory documentation from the first PR. Those remain local-only / follow-up scope unless explicitly approved later.

Safe Migration Route

PR branch: pr/feature-attn-sim-migration
Base branch: origin/main
Local commit: fe70f8f47e054a0dcaec28bb77a9f433380eb3cb
Commit subject: Preserve MLA attention migration across metrics boundaries
Commit scope: 81 paths, 24645 insertions, 267 deletions
Ancestry gate: git merge-base --is-ancestor origin/main HEAD exit 0
Whitespace gate: git diff --check origin/main..HEAD exit 0
Forbidden-path scan: 0 matches

What Changed

Attention families and metadata

Adds explicit attention-family abstractions for MHA, MQA, and latent MLA attention behavior.
Centralizes attention op naming, trace mapping, memory layout, profiling mapping, and model binding boundaries.
Extends quantization/op metadata handling so MLA trace ops are first-class instead of being accidentally treated as dense attention fallbacks.

Profiling and runtime evidence paths

Adds MLA / MHA / MQA analysis and live-probe helpers under tests/analysis/.
Adds frontier/profiling/attention/vllm_mla_profile_importer.py for MLA profiling import support inside Frontier scope.
Adds online per-op comparison utilities for chunked-prefill online traces.

Equivalence validation infrastructure

Adds deterministic artifact comparison tools under tests/e2e/attention_equivalence/:
- profile_manifest.py
- measurement_csv_equivalence.py
- simulation_output_equivalence.py
- run_equivalence_matrix.py
- matrix_cases.yaml
Adds unit coverage for attention families, tracing, profiling contracts, predictor semantics, equivalence harness behavior, and MLA runtime operator timing.

Root-Cause Fix Highlight: MLA Trace Metadata and Ledger Completeness

A final blocker was found and fixed before this commit:

Root cause: MLA attention introduced attn_mla_* operators that can be emitted by MetricsStore, but the quantization manager, op trace metadata, trace mapping, and stage-batch ledger were not all using the same attention-family boundary.
Fix: derive enabled attention ops from registered attention families, explicitly support MLA op metadata, enforce family-scoped structured timing completeness, and use the latent MLA attention family for ledger component mapping.
Structured ledger numeric check: component sum 0.42 ms, total 0.42 ms, delta 0.0 ms.

Validation Evidence

Targeted TDD evidence for MLA metadata / ledger fix

Gate	Result
RED before fix	`4 failed, 27 passed in 1.79s`
Targeted GREEN	`31 passed in 1.58s`
Expanded targeted GREEN	`65 passed in 3.29s`
Full migrated attention regression	`377 passed, 3 skipped, 2 warnings in 14.09s`

Extended strict equivalence matrix

First-six strict simulation-equivalence cases passed with zero mismatches:

dense_coloc_offline_short
dense_coloc_online_highqps
dense_pdd_offline_long
dense_pdd_online_highqps
moe_coloc_offline_short
moe_coloc_online_highqps

Result summary:

Matrix Slice	Status	Case Count	Pass Count	Mismatches
First-six strict matrix	`PASS`	`6`	`6`	`0` for every case
Original all-eight EP=1/DP=1 matrix	`REFERENCE_BASELINE_FAILS`	`8`	`6`	Source MoE PDD baseline rows fail before strict candidate comparison

Important caveat: this PR does not claim full 8-case EP=1/DP=1 equivalence PASS. The original EP=1/DP=1 source MoE PDD rows remain REFERENCE_BASELINE_FAILS for moe_pdd_offline_long and moe_pdd_online_highqps.

MoE PDD source-compatible strict equivalence closure

A source-compatible DP=2/EP=2 MoE PDD configuration passed strict source-vs-migration comparison:

Case	Rows	Columns	Numeric Comparisons	request_e2e_time ref/cand	ttft ref/cand	tpot ref/cand
offline DP=2/EP=2	`2/2`	`82/82`	`153`	`306.982576876489 / 306.982576876489`	`133.878460025234 / 133.878460025234`	`86.089884095705 / 86.089884095705`
online DP=2/EP=2	`4/4`	`82/82`	`307`	`1105.036249009642 / 1105.036249009642`	`380.290074694862 / 380.290074694862`	`172.905575260109 / 172.905575260109`

Result summary:

status=PASS
case_count=2
pass_count=2
mismatch_count=0

Live worker and deterministic artifact evidence

Bounded H800 worker smoke passed on NVIDIA H800 and produced both attention.csv and attention_kernel_only.csv under task-local artifacts.
Real deterministic artifact equivalence passed for four explicit cases: H800 attention.csv, H800 attention_kernel_only.csv, dense offline simulator output, and MoE offline simulator output.

Independent Review Status

Code-reviewer re-review: COMMENT; original blocker CLOSED; no CRITICAL/HIGH/MEDIUM issues.
Architecture re-review: WATCH; original architecture blocker CLOSED.

WATCH caveat to preserve: tracing and ledger paths are strict on family-complete structured timings. AttentionTime.total_time() remains a legacy compatibility path and is not itself a family-completeness checker.

Risks and Follow-ups

Remote CI has not run until this PR is opened.
The original EP=1/DP=1 source MoE PDD rows remain REFERENCE_BASELINE_FAILS; do not describe them as an 8-case strict PASS.
vLLM bridge work is intentionally excluded from this PR and should be handled as follow-up scope.
Current task-memory documentation remains local-only and is not part of the PR diff.
The local untracked artifact =10.1 remains intentionally uncommitted and undeleted.

Commands Used for Final Local Gates

cd /data/ycfeng/Frontier/worktrees/feature-attn-sim-migration
git status --short --branch
git diff --check origin/main..HEAD
git merge-base --is-ancestor origin/main HEAD
git diff --name-only origin/main..HEAD

Latest confirmed local results:

Branch state: pr/feature-attn-sim-migration...origin/main [ahead 1]
Remaining untracked local artifact: =10.1
git diff --check origin/main..HEAD: exit 0
git merge-base --is-ancestor origin/main HEAD: exit 0
Commit path count: 81
Forbidden path count: 0

Migrate attention-family modeling, MLA profiling/prediction support, equivalence harnesses, and regression coverage onto the normal origin/main ancestry while excluding the vLLM bridge and local task-memory artifacts. Constraint: First Frontier PR excludes vLLM bridge, task-memory artifacts, runtime state, and local =10.1 artifact. Rejected: Direct --allow-unrelated-histories merge | it would mix unrelated roots and create an unreviewable broad replacement diff. Rejected: Fallback dense-op mapping for attn_mla_* | it would hide missing metadata and ledger contracts. Confidence: high Scope-risk: moderate Directive: Keep attention-family consumers on shared family mappers when structured operator timings exist; keep vLLM bridge work in a separate follow-up PR. Tested: Expanded targeted MLA/family suite 65 passed in 3.29s; full migrated attention regression 377 passed, 3 skipped, 2 warnings in 14.09s; staged diff check exit 0; staged forbidden paths 0. Not-tested: Remote CI, branch push, and PR creation are not run pending explicit user approval.

The analytical KV-cache transfer predictor and the op-trace transfer metadata both sized the cross-cluster KV payload with the dense formula num_kv_heads x head_dim x 2. For latent MLA this silently injected the dense head layout (~49152 elems/token) instead of the latent layout (kv_lora_rank + qk_rope_head_dim = 576, kv_factor = 1), polluting transfer latency and TTFT under pd-disaggregation with no error raised. Resolve the runtime KV layout through bind_attention_family + get_attention_runtime_kv_layout at both sites and replace the literal x2 with layout.kv_factor. Dense stays byte-identical (kv_factor=2, runtime heads = num_kv_heads, runtime head size = head_dim); MLA now uses (1, 576, 1). Existing override_num_heads / override_head_dim precedence is preserved (family still owns kv_factor). Tests: dense assertions unchanged; new latent-MLA cases lock transfer size 18432 (batch) / 6912 (request) and trace meta [4,2,1,576,1].

The shared ExecutionTimePredictionModelManager (always built at simulator.py:153 and used by the pd-disaggregation path) hard-coded the dense attention family for attention-core training. Feeding it a latent-MLA profile failed fast at shared_prediction_model_manager.py with "Missing columns for attn_kv_cache_save training: ['total_tokens', 'kv_cache_size']", so the six attn_mla_* models were never produced for the disagg path while the monolithic predictor trained them fine. Make attention-core training family-aware, mirroring the monolithic reference (sklearn_execution_time_predictor.py): route MLA profiles through a structural filter in _load_attention_df, normalize MLA derived features, and train exactly the six attn_mla_* operators in _train_attn_models_for_cluster before the dense block, attaching the shared _build_exact_feature_lookup for on-demand prediction. This is training-only: the on-demand consumer already dispatches family-aware and works unchanged once the six estimators exist. Dense path is untouched. Tests: new eager-attention MLA suite locks the six trained ops, their feature/target columns, the sparse-by-target row counts (kv_cache_save=3, others=2), the exact-lookup attachment, and the structural-filter row selection.

fwyc0573 added 3 commits June 29, 2026 15:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Preserve MLA attention migration across metrics boundaries#3

Preserve MLA attention migration across metrics boundaries#3
fwyc0573 wants to merge 3 commits into
mainfrom
pr/feature-attn-sim-migration

fwyc0573 commented Jun 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

fwyc0573 commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Description Draft: Preserve MLA attention migration across metrics boundaries

Title

Summary

Safe Migration Route

What Changed

Attention families and metadata

Profiling and runtime evidence paths

Equivalence validation infrastructure

Root-Cause Fix Highlight: MLA Trace Metadata and Ledger Completeness

Validation Evidence

Targeted TDD evidence for MLA metadata / ledger fix

Extended strict equivalence matrix

MoE PDD source-compatible strict equivalence closure

Live worker and deterministic artifact evidence

Independent Review Status

Risks and Follow-ups

Commands Used for Final Local Gates

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fwyc0573 commented Jun 29, 2026 •

edited

Loading