10x inference: bypass hull for passthrough and position-keyed heads#2
10x inference: bypass hull for passthrough and position-keyed heads#2trulite wants to merge 2 commits intoPercepta-Core:mainfrom
Conversation
…ries After sequential generation completes, re-runs the forward pass with two optimizations: 1. Batch all token projections into dgemm (matrix-matrix) instead of per-token dgemv (matrix-vector). Accelerate/OpenBLAS automatically uses all CPU cores + AMX. 2. Parallelize hull insert+query across attention heads with OpenMP. Each head's hull is independent — no synchronization needed. Loop nesting is flipped (outer=heads, inner=positions) so there's one OpenMP fork per layer instead of one per position. Results on M4 Pro (Sudoku, 980K tokens): Sequential: 37.7s (26K tok/s) Batched: 6.0s (165K tok/s) — 6.3x faster The build system detects libomp on macOS (homebrew) and links OpenBLAS + libgomp on Linux. Falls back gracefully when neither is available. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The weight builder now classifies each attention head: 0 = lookup (needs hull for key-value search) 1 = passthrough (output = V[t], proven: score(t,t) > score(t,s)) 2 = gather (position-keyed lookup, output = V[round(qx/qy)]) The head_type metadata is saved to model.bin and read by the C++ engine. In the batched verification, passthrough heads copy V[t] and gather heads compute a single array index — both O(1) per position, no hull insert/query needed. Mathematical proof (gather): for keys on the parabola (2s, -s²), the score -(s-q)² + q² is uniquely maximized at s=q. The quadratic penalty (≥1 for integer keys) dominates the tiebreak perturbation (<0.62). The query value q is always an integer (sum of integer cumsums) and q ≤ t (causality by construction). Results on M4 Pro (Sudoku, 980K tokens): PR Percepta-Core#1 (hull only): 5.96s (6.3× over sequential) + passthrough bypass: 5.50s (6.6×) + gather bypass: 3.96s (9.6×) 17 passthrough + 15 gather = 32 heads skip the hull entirely. The remaining 9 active lookup heads have small key sets (K ≤ 200) and run the hull in microseconds. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Ryvn Preview Creating preview This comment will be automatically updated with preview details. |
|
Hey @trulite — wanted to close the loop on this. Both this PR and #1 cherry-picked cleanly into a downstream fork at https://github.com/oaustegard/transformer-vm with attribution preserved on the original commits, and I ran the engine matrix on Linux + OpenBLAS that you'd flagged as untested in the test plans. Linux validation passes: token output is byte-identical to the sequential loop on Speedup numbers came in qualitatively but smaller than your M4 Pro Sudoku run: 3.0× from PR #1 alone and 1.4× incremental from PR #2's head bypass on top. Plausibly explained by OpenBLAS-on-x86 vs Accelerate+AMX, smaller model (D=38, L=7), and shorter sequences. The PR #2 head-bypass uplift breaks down to +10–40% across programs (collatz benefits most at +41%; mcm least at +17%). One observation worth flagging: on the engine-matrix sweep, naive batched is a wash (1.0–1.3×), which isolates the speedup source — it's the dgemm batching that does the work, with parallel-hull as a smaller secondary multiplier. PR #1's framing bundles them together as "6.3×"; on this hardware/model the dgemm-batched-only win is most of it. Full writeup: https://github.com/oaustegard/transformer-vm/blob/main/results/issue18_investigation.md Thanks for the work — these were both clean PRs to integrate. |
Summary
Mathematical basis
Passthrough heads use erase Q/K pattern:
score(t,s) = HARD_K·√2·(2ts + 1), maximized at s=t. Proven by the weight builder's separation of passthrough and lookup heads (line 486-508 of weights.py).Gather heads key on position via the parabolic encoding:
score = -(s-q)² + q². The quadratic penalty (≥1 for integers) dominates the tiebreak perturbation (<0.62). The gather indexround(qx/qy)recovers the 1D query from the 2D attention scores.Results (M4 Pro, Sudoku 980K tokens)
32 of 42 active heads bypass the hull. The remaining 9 lookup heads have small key sets (K ≤ 200) and run the hull in microseconds.
Head classification (this build)
Test plan
🤖 Generated with Claude Code