Preserve MLA attention migration across metrics boundaries#3
Open
fwyc0573 wants to merge 3 commits into
Open
Conversation
Migrate attention-family modeling, MLA profiling/prediction support, equivalence harnesses, and regression coverage onto the normal origin/main ancestry while excluding the vLLM bridge and local task-memory artifacts. Constraint: First Frontier PR excludes vLLM bridge, task-memory artifacts, runtime state, and local =10.1 artifact. Rejected: Direct --allow-unrelated-histories merge | it would mix unrelated roots and create an unreviewable broad replacement diff. Rejected: Fallback dense-op mapping for attn_mla_* | it would hide missing metadata and ledger contracts. Confidence: high Scope-risk: moderate Directive: Keep attention-family consumers on shared family mappers when structured operator timings exist; keep vLLM bridge work in a separate follow-up PR. Tested: Expanded targeted MLA/family suite 65 passed in 3.29s; full migrated attention regression 377 passed, 3 skipped, 2 warnings in 14.09s; staged diff check exit 0; staged forbidden paths 0. Not-tested: Remote CI, branch push, and PR creation are not run pending explicit user approval.
The analytical KV-cache transfer predictor and the op-trace transfer metadata both sized the cross-cluster KV payload with the dense formula num_kv_heads x head_dim x 2. For latent MLA this silently injected the dense head layout (~49152 elems/token) instead of the latent layout (kv_lora_rank + qk_rope_head_dim = 576, kv_factor = 1), polluting transfer latency and TTFT under pd-disaggregation with no error raised. Resolve the runtime KV layout through bind_attention_family + get_attention_runtime_kv_layout at both sites and replace the literal x2 with layout.kv_factor. Dense stays byte-identical (kv_factor=2, runtime heads = num_kv_heads, runtime head size = head_dim); MLA now uses (1, 576, 1). Existing override_num_heads / override_head_dim precedence is preserved (family still owns kv_factor). Tests: dense assertions unchanged; new latent-MLA cases lock transfer size 18432 (batch) / 6912 (request) and trace meta [4,2,1,576,1].
The shared ExecutionTimePredictionModelManager (always built at simulator.py:153 and used by the pd-disaggregation path) hard-coded the dense attention family for attention-core training. Feeding it a latent-MLA profile failed fast at shared_prediction_model_manager.py with "Missing columns for attn_kv_cache_save training: ['total_tokens', 'kv_cache_size']", so the six attn_mla_* models were never produced for the disagg path while the monolithic predictor trained them fine. Make attention-core training family-aware, mirroring the monolithic reference (sklearn_execution_time_predictor.py): route MLA profiles through a structural filter in _load_attention_df, normalize MLA derived features, and train exactly the six attn_mla_* operators in _train_attn_models_for_cluster before the dense block, attaching the shared _build_exact_feature_lookup for on-demand prediction. This is training-only: the on-demand consumer already dispatches family-aware and works unchanged once the six estimators exist. Dense path is untouched. Tests: new eager-attention MLA suite locks the six trained ops, their feature/target columns, the sparse-by-target row counts (kv_cache_save=3, others=2), the exact-lookup attachment, and the structural-filter row selection.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR Description Draft: Preserve MLA attention migration across metrics boundaries
Title
Preserve MLA attention migration across metrics boundaries
Summary
This PR migrates the reviewed
feature-attn-simattention refactor onto a cleanorigin/mainancestry branch without using--allow-unrelated-histories. It keeps the first PR focused on Frontier attention measurement/simulation, op-level tracing, predictor/runtime metadata, profiling importers, and reproducible equivalence validation.The migration intentionally excludes the vLLM bridge and the current task-memory documentation from the first PR. Those remain local-only / follow-up scope unless explicitly approved later.
Safe Migration Route
pr/feature-attn-sim-migrationorigin/mainfe70f8f47e054a0dcaec28bb77a9f433380eb3cbPreserve MLA attention migration across metrics boundaries81paths,24645insertions,267deletionsgit merge-base --is-ancestor origin/main HEADexit0git diff --check origin/main..HEADexit00matchesWhat Changed
Attention families and metadata
Profiling and runtime evidence paths
tests/analysis/.frontier/profiling/attention/vllm_mla_profile_importer.pyfor MLA profiling import support inside Frontier scope.Equivalence validation infrastructure
tests/e2e/attention_equivalence/:profile_manifest.pymeasurement_csv_equivalence.pysimulation_output_equivalence.pyrun_equivalence_matrix.pymatrix_cases.yamlRoot-Cause Fix Highlight: MLA Trace Metadata and Ledger Completeness
A final blocker was found and fixed before this commit:
attn_mla_*operators that can be emitted byMetricsStore, but the quantization manager, op trace metadata, trace mapping, and stage-batch ledger were not all using the same attention-family boundary.0.42 ms, total0.42 ms, delta0.0 ms.Validation Evidence
Targeted TDD evidence for MLA metadata / ledger fix
4 failed, 27 passed in 1.79s31 passed in 1.58s65 passed in 3.29s377 passed, 3 skipped, 2 warnings in 14.09sExtended strict equivalence matrix
First-six strict simulation-equivalence cases passed with zero mismatches:
dense_coloc_offline_shortdense_coloc_online_highqpsdense_pdd_offline_longdense_pdd_online_highqpsmoe_coloc_offline_shortmoe_coloc_online_highqpsResult summary:
PASS660for every caseREFERENCE_BASELINE_FAILS86Important caveat: this PR does not claim full 8-case EP=1/DP=1 equivalence PASS. The original EP=1/DP=1 source MoE PDD rows remain
REFERENCE_BASELINE_FAILSformoe_pdd_offline_longandmoe_pdd_online_highqps.MoE PDD source-compatible strict equivalence closure
A source-compatible DP=2/EP=2 MoE PDD configuration passed strict source-vs-migration comparison:
2/282/82153306.982576876489 / 306.982576876489133.878460025234 / 133.87846002523486.089884095705 / 86.0898840957054/482/823071105.036249009642 / 1105.036249009642380.290074694862 / 380.290074694862172.905575260109 / 172.905575260109Result summary:
status=PASScase_count=2pass_count=2mismatch_count=0Live worker and deterministic artifact evidence
NVIDIA H800and produced bothattention.csvandattention_kernel_only.csvunder task-local artifacts.attention.csv, H800attention_kernel_only.csv, dense offline simulator output, and MoE offline simulator output.Independent Review Status
COMMENT; original blockerCLOSED; no CRITICAL/HIGH/MEDIUM issues.WATCH; original architecture blockerCLOSED.WATCH caveat to preserve: tracing and ledger paths are strict on family-complete structured timings.
AttentionTime.total_time()remains a legacy compatibility path and is not itself a family-completeness checker.Risks and Follow-ups
REFERENCE_BASELINE_FAILS; do not describe them as an 8-case strict PASS.=10.1remains intentionally uncommitted and undeleted.Commands Used for Final Local Gates
cd /data/ycfeng/Frontier/worktrees/feature-attn-sim-migration git status --short --branch git diff --check origin/main..HEAD git merge-base --is-ancestor origin/main HEAD git diff --name-only origin/main..HEADLatest confirmed local results:
pr/feature-attn-sim-migration...origin/main [ahead 1]=10.1git diff --check origin/main..HEAD: exit0git merge-base --is-ancestor origin/main HEAD: exit0810