Skip to content

Preserve MLA attention migration across metrics boundaries#3

Open
fwyc0573 wants to merge 3 commits into
mainfrom
pr/feature-attn-sim-migration
Open

Preserve MLA attention migration across metrics boundaries#3
fwyc0573 wants to merge 3 commits into
mainfrom
pr/feature-attn-sim-migration

Conversation

@fwyc0573

@fwyc0573 fwyc0573 commented Jun 29, 2026

Copy link
Copy Markdown
Collaborator

PR Description Draft: Preserve MLA attention migration across metrics boundaries

Title

Preserve MLA attention migration across metrics boundaries

Summary

This PR migrates the reviewed feature-attn-sim attention refactor onto a clean origin/main ancestry branch without using --allow-unrelated-histories. It keeps the first PR focused on Frontier attention measurement/simulation, op-level tracing, predictor/runtime metadata, profiling importers, and reproducible equivalence validation.

The migration intentionally excludes the vLLM bridge and the current task-memory documentation from the first PR. Those remain local-only / follow-up scope unless explicitly approved later.

Safe Migration Route

  • PR branch: pr/feature-attn-sim-migration
  • Base branch: origin/main
  • Local commit: fe70f8f47e054a0dcaec28bb77a9f433380eb3cb
  • Commit subject: Preserve MLA attention migration across metrics boundaries
  • Commit scope: 81 paths, 24645 insertions, 267 deletions
  • Ancestry gate: git merge-base --is-ancestor origin/main HEAD exit 0
  • Whitespace gate: git diff --check origin/main..HEAD exit 0
  • Forbidden-path scan: 0 matches

What Changed

Attention families and metadata

  • Adds explicit attention-family abstractions for MHA, MQA, and latent MLA attention behavior.
  • Centralizes attention op naming, trace mapping, memory layout, profiling mapping, and model binding boundaries.
  • Extends quantization/op metadata handling so MLA trace ops are first-class instead of being accidentally treated as dense attention fallbacks.

Profiling and runtime evidence paths

  • Adds MLA / MHA / MQA analysis and live-probe helpers under tests/analysis/.
  • Adds frontier/profiling/attention/vllm_mla_profile_importer.py for MLA profiling import support inside Frontier scope.
  • Adds online per-op comparison utilities for chunked-prefill online traces.

Equivalence validation infrastructure

  • Adds deterministic artifact comparison tools under tests/e2e/attention_equivalence/:
    • profile_manifest.py
    • measurement_csv_equivalence.py
    • simulation_output_equivalence.py
    • run_equivalence_matrix.py
    • matrix_cases.yaml
  • Adds unit coverage for attention families, tracing, profiling contracts, predictor semantics, equivalence harness behavior, and MLA runtime operator timing.

Root-Cause Fix Highlight: MLA Trace Metadata and Ledger Completeness

A final blocker was found and fixed before this commit:

  • Root cause: MLA attention introduced attn_mla_* operators that can be emitted by MetricsStore, but the quantization manager, op trace metadata, trace mapping, and stage-batch ledger were not all using the same attention-family boundary.
  • Fix: derive enabled attention ops from registered attention families, explicitly support MLA op metadata, enforce family-scoped structured timing completeness, and use the latent MLA attention family for ledger component mapping.
  • Structured ledger numeric check: component sum 0.42 ms, total 0.42 ms, delta 0.0 ms.

Validation Evidence

Targeted TDD evidence for MLA metadata / ledger fix

Gate Result
RED before fix 4 failed, 27 passed in 1.79s
Targeted GREEN 31 passed in 1.58s
Expanded targeted GREEN 65 passed in 3.29s
Full migrated attention regression 377 passed, 3 skipped, 2 warnings in 14.09s

Extended strict equivalence matrix

First-six strict simulation-equivalence cases passed with zero mismatches:

  1. dense_coloc_offline_short
  2. dense_coloc_online_highqps
  3. dense_pdd_offline_long
  4. dense_pdd_online_highqps
  5. moe_coloc_offline_short
  6. moe_coloc_online_highqps

Result summary:

Matrix Slice Status Case Count Pass Count Mismatches
First-six strict matrix PASS 6 6 0 for every case
Original all-eight EP=1/DP=1 matrix REFERENCE_BASELINE_FAILS 8 6 Source MoE PDD baseline rows fail before strict candidate comparison

Important caveat: this PR does not claim full 8-case EP=1/DP=1 equivalence PASS. The original EP=1/DP=1 source MoE PDD rows remain REFERENCE_BASELINE_FAILS for moe_pdd_offline_long and moe_pdd_online_highqps.

MoE PDD source-compatible strict equivalence closure

A source-compatible DP=2/EP=2 MoE PDD configuration passed strict source-vs-migration comparison:

Case Rows Columns Numeric Comparisons request_e2e_time ref/cand ttft ref/cand tpot ref/cand
offline DP=2/EP=2 2/2 82/82 153 306.982576876489 / 306.982576876489 133.878460025234 / 133.878460025234 86.089884095705 / 86.089884095705
online DP=2/EP=2 4/4 82/82 307 1105.036249009642 / 1105.036249009642 380.290074694862 / 380.290074694862 172.905575260109 / 172.905575260109

Result summary:

  • status=PASS
  • case_count=2
  • pass_count=2
  • mismatch_count=0

Live worker and deterministic artifact evidence

  • Bounded H800 worker smoke passed on NVIDIA H800 and produced both attention.csv and attention_kernel_only.csv under task-local artifacts.
  • Real deterministic artifact equivalence passed for four explicit cases: H800 attention.csv, H800 attention_kernel_only.csv, dense offline simulator output, and MoE offline simulator output.

Independent Review Status

  • Code-reviewer re-review: COMMENT; original blocker CLOSED; no CRITICAL/HIGH/MEDIUM issues.
  • Architecture re-review: WATCH; original architecture blocker CLOSED.

WATCH caveat to preserve: tracing and ledger paths are strict on family-complete structured timings. AttentionTime.total_time() remains a legacy compatibility path and is not itself a family-completeness checker.

Risks and Follow-ups

  1. Remote CI has not run until this PR is opened.
  2. The original EP=1/DP=1 source MoE PDD rows remain REFERENCE_BASELINE_FAILS; do not describe them as an 8-case strict PASS.
  3. vLLM bridge work is intentionally excluded from this PR and should be handled as follow-up scope.
  4. Current task-memory documentation remains local-only and is not part of the PR diff.
  5. The local untracked artifact =10.1 remains intentionally uncommitted and undeleted.

Commands Used for Final Local Gates

cd /data/ycfeng/Frontier/worktrees/feature-attn-sim-migration
git status --short --branch
git diff --check origin/main..HEAD
git merge-base --is-ancestor origin/main HEAD
git diff --name-only origin/main..HEAD

Latest confirmed local results:

  • Branch state: pr/feature-attn-sim-migration...origin/main [ahead 1]
  • Remaining untracked local artifact: =10.1
  • git diff --check origin/main..HEAD: exit 0
  • git merge-base --is-ancestor origin/main HEAD: exit 0
  • Commit path count: 81
  • Forbidden path count: 0

fwyc0573 added 3 commits June 29, 2026 15:48
Migrate attention-family modeling, MLA profiling/prediction support, equivalence harnesses, and regression coverage onto the normal origin/main ancestry while excluding the vLLM bridge and local task-memory artifacts.

Constraint: First Frontier PR excludes vLLM bridge, task-memory artifacts, runtime state, and local =10.1 artifact.

Rejected: Direct --allow-unrelated-histories merge | it would mix unrelated roots and create an unreviewable broad replacement diff.

Rejected: Fallback dense-op mapping for attn_mla_* | it would hide missing metadata and ledger contracts.

Confidence: high

Scope-risk: moderate

Directive: Keep attention-family consumers on shared family mappers when structured operator timings exist; keep vLLM bridge work in a separate follow-up PR.

Tested: Expanded targeted MLA/family suite 65 passed in 3.29s; full migrated attention regression 377 passed, 3 skipped, 2 warnings in 14.09s; staged diff check exit 0; staged forbidden paths 0.

Not-tested: Remote CI, branch push, and PR creation are not run pending explicit user approval.
The analytical KV-cache transfer predictor and the op-trace transfer
metadata both sized the cross-cluster KV payload with the dense formula
num_kv_heads x head_dim x 2. For latent MLA this silently injected the
dense head layout (~49152 elems/token) instead of the latent layout
(kv_lora_rank + qk_rope_head_dim = 576, kv_factor = 1), polluting
transfer latency and TTFT under pd-disaggregation with no error raised.

Resolve the runtime KV layout through bind_attention_family +
get_attention_runtime_kv_layout at both sites and replace the literal
x2 with layout.kv_factor. Dense stays byte-identical (kv_factor=2,
runtime heads = num_kv_heads, runtime head size = head_dim); MLA now
uses (1, 576, 1). Existing override_num_heads / override_head_dim
precedence is preserved (family still owns kv_factor).

Tests: dense assertions unchanged; new latent-MLA cases lock transfer
size 18432 (batch) / 6912 (request) and trace meta [4,2,1,576,1].
The shared ExecutionTimePredictionModelManager (always built at
simulator.py:153 and used by the pd-disaggregation path) hard-coded
the dense attention family for attention-core training. Feeding it a
latent-MLA profile failed fast at shared_prediction_model_manager.py
with "Missing columns for attn_kv_cache_save training:
['total_tokens', 'kv_cache_size']", so the six attn_mla_* models were
never produced for the disagg path while the monolithic predictor
trained them fine.

Make attention-core training family-aware, mirroring the monolithic
reference (sklearn_execution_time_predictor.py): route MLA profiles
through a structural filter in _load_attention_df, normalize MLA
derived features, and train exactly the six attn_mla_* operators in
_train_attn_models_for_cluster before the dense block, attaching the
shared _build_exact_feature_lookup for on-demand prediction. This is
training-only: the on-demand consumer already dispatches family-aware
and works unchanged once the six estimators exist. Dense path is
untouched.

Tests: new eager-attention MLA suite locks the six trained ops, their
feature/target columns, the sparse-by-target row counts
(kv_cache_save=3, others=2), the exact-lookup attachment, and the
structural-filter row selection.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant