10x inference: bypass hull for passthrough and position-keyed heads by trulite · Pull Request #2 · Percepta-Core/transformer-vm

trulite · 2026-03-27T11:49:08Z

Summary

Classify attention heads at weight-build time: passthrough (output = V[t]), gather (position-keyed lookup = V[round(q)]), or hull (needs search)
Save head_type metadata to model.bin alongside head_tiebreak
In batched verification, passthrough and gather heads skip the hull entirely — O(1) per position instead of O(log n)

Mathematical basis

Passthrough heads use erase Q/K pattern: score(t,s) = HARD_K·√2·(2ts + 1), maximized at s=t. Proven by the weight builder's separation of passthrough and lookup heads (line 486-508 of weights.py).

Gather heads key on position via the parabolic encoding: score = -(s-q)² + q². The quadratic penalty (≥1 for integers) dominates the tiebreak perturbation (<0.62). The gather index round(qx/qy) recovers the 1D query from the 2D attention scores.

Results (M4 Pro, Sudoku 980K tokens)

	Time	Speedup	Hull time
Sequential baseline	37.5s	1×	—
PR #1 (dgemm + parallel hull)	5.96s	6.3×	4.6s
This PR	3.96s	9.6×	2.8s

32 of 42 active heads bypass the hull. The remaining 9 lookup heads have small key sets (K ≤ 200) and run the hull in microseconds.

Head classification (this build)

hull=9 active  (stack_depth, memory, local, call_stack, byte_index, cumsum)
passthrough=17 (erase/slot copy)
gather=15      (instruction fetch, store_bytes, top/second/third_byte, const_byte)
unused=92      (layers 5-6 have no lookups)

Test plan

hello, addition, collatz, fibonacci, min_cost_matching, sudoku — all PASS
Spot check OK on Sudoku (980K tokens)
Test on Linux with OpenBLAS

🤖 Generated with Claude Code

…ries After sequential generation completes, re-runs the forward pass with two optimizations: 1. Batch all token projections into dgemm (matrix-matrix) instead of per-token dgemv (matrix-vector). Accelerate/OpenBLAS automatically uses all CPU cores + AMX. 2. Parallelize hull insert+query across attention heads with OpenMP. Each head's hull is independent — no synchronization needed. Loop nesting is flipped (outer=heads, inner=positions) so there's one OpenMP fork per layer instead of one per position. Results on M4 Pro (Sudoku, 980K tokens): Sequential: 37.7s (26K tok/s) Batched: 6.0s (165K tok/s) — 6.3x faster The build system detects libomp on macOS (homebrew) and links OpenBLAS + libgomp on Linux. Falls back gracefully when neither is available. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The weight builder now classifies each attention head: 0 = lookup (needs hull for key-value search) 1 = passthrough (output = V[t], proven: score(t,t) > score(t,s)) 2 = gather (position-keyed lookup, output = V[round(qx/qy)]) The head_type metadata is saved to model.bin and read by the C++ engine. In the batched verification, passthrough heads copy V[t] and gather heads compute a single array index — both O(1) per position, no hull insert/query needed. Mathematical proof (gather): for keys on the parabola (2s, -s²), the score -(s-q)² + q² is uniquely maximized at s=q. The quadratic penalty (≥1 for integer keys) dominates the tiebreak perturbation (<0.62). The query value q is always an integer (sum of integer cumsums) and q ≤ t (causality by construction). Results on M4 Pro (Sudoku, 980K tokens): PR Percepta-Core#1 (hull only): 5.96s (6.3× over sequential) + passthrough bypass: 5.50s (6.6×) + gather bypass: 3.96s (9.6×) 17 passthrough + 15 gather = 32 heads skip the hull entirely. The remaining 9 active lookup heads have small key sets (K ≤ 200) and run the hull in microseconds. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ryvn-technologies · 2026-03-27T11:49:11Z

Ryvn Preview

Creating preview prerelease-Percepta-Core-transformer-vm for this pull request.

_{This comment will be automatically updated with preview details.}

oaustegard · 2026-04-26T13:54:22Z

Hey @trulite — wanted to close the loop on this. Both this PR and #1 cherry-picked cleanly into a downstream fork at https://github.com/oaustegard/transformer-vm with attribution preserved on the original commits, and I ran the engine matrix on Linux + OpenBLAS that you'd flagged as untested in the test plans.

Linux validation passes: token output is byte-identical to the sequential loop on hello (1K), addition (4K), collatz (45K), fibonacci (9K), and min_cost_matching (178K). Built with g++ -fopenmp -lopenblas. The OMP #pragma does need -fopenmp explicitly on Linux — without it the parallel hull silently single-threads — so I added a Makefile target that wires that up.

Speedup numbers came in qualitatively but smaller than your M4 Pro Sudoku run: 3.0× from PR #1 alone and 1.4× incremental from PR #2's head bypass on top. Plausibly explained by OpenBLAS-on-x86 vs Accelerate+AMX, smaller model (D=38, L=7), and shorter sequences. The PR #2 head-bypass uplift breaks down to +10–40% across programs (collatz benefits most at +41%; mcm least at +17%).

One observation worth flagging: on the engine-matrix sweep, naive batched is a wash (1.0–1.3×), which isolates the speedup source — it's the dgemm batching that does the work, with parallel-hull as a smaller secondary multiplier. PR #1's framing bundles them together as "6.3×"; on this hardware/model the dgemm-batched-only win is most of it.

Full writeup: https://github.com/oaustegard/transformer-vm/blob/main/results/issue18_investigation.md

Thanks for the work — these were both clean PRs to integrate.

trulite and others added 2 commits March 27, 2026 03:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

10x inference: bypass hull for passthrough and position-keyed heads#2

10x inference: bypass hull for passthrough and position-keyed heads#2
trulite wants to merge 2 commits intoPercepta-Core:mainfrom
trulite:head-type-bypass

trulite commented Mar 27, 2026

Uh oh!

ryvn-technologies Bot commented Mar 27, 2026

Uh oh!

oaustegard commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

trulite commented Mar 27, 2026

Summary

Mathematical basis

Results (M4 Pro, Sudoku 980K tokens)

Head classification (this build)

Test plan

Uh oh!

ryvn-technologies Bot commented Mar 27, 2026

Uh oh!

oaustegard commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants