Skip to content

10x inference: bypass hull for passthrough and position-keyed heads#2

Open
trulite wants to merge 2 commits intoPercepta-Core:mainfrom
trulite:head-type-bypass
Open

10x inference: bypass hull for passthrough and position-keyed heads#2
trulite wants to merge 2 commits intoPercepta-Core:mainfrom
trulite:head-type-bypass

Conversation

@trulite
Copy link
Copy Markdown

@trulite trulite commented Mar 27, 2026

Summary

  • Classify attention heads at weight-build time: passthrough (output = V[t]), gather (position-keyed lookup = V[round(q)]), or hull (needs search)
  • Save head_type metadata to model.bin alongside head_tiebreak
  • In batched verification, passthrough and gather heads skip the hull entirely — O(1) per position instead of O(log n)

Mathematical basis

Passthrough heads use erase Q/K pattern: score(t,s) = HARD_K·√2·(2ts + 1), maximized at s=t. Proven by the weight builder's separation of passthrough and lookup heads (line 486-508 of weights.py).

Gather heads key on position via the parabolic encoding: score = -(s-q)² + q². The quadratic penalty (≥1 for integers) dominates the tiebreak perturbation (<0.62). The gather index round(qx/qy) recovers the 1D query from the 2D attention scores.

Results (M4 Pro, Sudoku 980K tokens)

Time Speedup Hull time
Sequential baseline 37.5s
PR #1 (dgemm + parallel hull) 5.96s 6.3× 4.6s
This PR 3.96s 9.6× 2.8s

32 of 42 active heads bypass the hull. The remaining 9 lookup heads have small key sets (K ≤ 200) and run the hull in microseconds.

Head classification (this build)

hull=9 active  (stack_depth, memory, local, call_stack, byte_index, cumsum)
passthrough=17 (erase/slot copy)
gather=15      (instruction fetch, store_bytes, top/second/third_byte, const_byte)
unused=92      (layers 5-6 have no lookups)

Test plan

  • hello, addition, collatz, fibonacci, min_cost_matching, sudoku — all PASS
  • Spot check OK on Sudoku (980K tokens)
  • Test on Linux with OpenBLAS

🤖 Generated with Claude Code

trulite and others added 2 commits March 27, 2026 03:11
…ries

After sequential generation completes, re-runs the forward pass with two
optimizations:

1. Batch all token projections into dgemm (matrix-matrix) instead of
   per-token dgemv (matrix-vector). Accelerate/OpenBLAS automatically
   uses all CPU cores + AMX.

2. Parallelize hull insert+query across attention heads with OpenMP.
   Each head's hull is independent — no synchronization needed. Loop
   nesting is flipped (outer=heads, inner=positions) so there's one
   OpenMP fork per layer instead of one per position.

Results on M4 Pro (Sudoku, 980K tokens):
  Sequential:  37.7s  (26K tok/s)
  Batched:      6.0s  (165K tok/s)  — 6.3x faster

The build system detects libomp on macOS (homebrew) and links OpenBLAS +
libgomp on Linux. Falls back gracefully when neither is available.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The weight builder now classifies each attention head:
  0 = lookup (needs hull for key-value search)
  1 = passthrough (output = V[t], proven: score(t,t) > score(t,s))
  2 = gather (position-keyed lookup, output = V[round(qx/qy)])

The head_type metadata is saved to model.bin and read by the C++
engine. In the batched verification, passthrough heads copy V[t]
and gather heads compute a single array index — both O(1) per
position, no hull insert/query needed.

Mathematical proof (gather): for keys on the parabola (2s, -s²),
the score -(s-q)² + q² is uniquely maximized at s=q. The quadratic
penalty (≥1 for integer keys) dominates the tiebreak perturbation
(<0.62). The query value q is always an integer (sum of integer
cumsums) and q ≤ t (causality by construction).

Results on M4 Pro (Sudoku, 980K tokens):
  PR Percepta-Core#1 (hull only):     5.96s  (6.3× over sequential)
  + passthrough bypass:  5.50s  (6.6×)
  + gather bypass:       3.96s  (9.6×)

17 passthrough + 15 gather = 32 heads skip the hull entirely.
The remaining 9 active lookup heads have small key sets (K ≤ 200)
and run the hull in microseconds.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ryvn-technologies
Copy link
Copy Markdown

Ryvn Preview

Creating preview prerelease-Percepta-Core-transformer-vm for this pull request.


This comment will be automatically updated with preview details.

@oaustegard
Copy link
Copy Markdown

Hey @trulite — wanted to close the loop on this. Both this PR and #1 cherry-picked cleanly into a downstream fork at https://github.com/oaustegard/transformer-vm with attribution preserved on the original commits, and I ran the engine matrix on Linux + OpenBLAS that you'd flagged as untested in the test plans.

Linux validation passes: token output is byte-identical to the sequential loop on hello (1K), addition (4K), collatz (45K), fibonacci (9K), and min_cost_matching (178K). Built with g++ -fopenmp -lopenblas. The OMP #pragma does need -fopenmp explicitly on Linux — without it the parallel hull silently single-threads — so I added a Makefile target that wires that up.

Speedup numbers came in qualitatively but smaller than your M4 Pro Sudoku run: 3.0× from PR #1 alone and 1.4× incremental from PR #2's head bypass on top. Plausibly explained by OpenBLAS-on-x86 vs Accelerate+AMX, smaller model (D=38, L=7), and shorter sequences. The PR #2 head-bypass uplift breaks down to +10–40% across programs (collatz benefits most at +41%; mcm least at +17%).

One observation worth flagging: on the engine-matrix sweep, naive batched is a wash (1.0–1.3×), which isolates the speedup source — it's the dgemm batching that does the work, with parallel-hull as a smaller secondary multiplier. PR #1's framing bundles them together as "6.3×"; on this hardware/model the dgemm-batched-only win is most of it.

Full writeup: https://github.com/oaustegard/transformer-vm/blob/main/results/issue18_investigation.md

Thanks for the work — these were both clean PRs to integrate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants