Commit d519ddf
Accelerate cblas: implemented but disabled — dequant cost too high
cblas_sgemv (AMX) is 36x faster than manual dot for FP32 512×2048.
BUT: IQ2→FP32 dequant per cache miss costs ~15ms, overwhelming
the 0.019ms cblas gain. With 256 experts/layer, miss rate is >90%.
Key learning:
cblas alone: 12ms/token (83 tok/s) — incredible
dequant+cblas: 1720ms/token (0.6 tok/s) — dequant dominates
fused IQ2 dot: 370ms/token (3.7 tok/s) — no dequant needed
cblas would win IF we could pre-dequant all experts (90 GB FP32).
The fundamental bottleneck is IQ2_XXS decode complexity.
Added: Accelerate framework linkage, cblas cache infrastructure,
Metal matmul test (individual IQ2 kernel works, MoE dispatch hangs).
Stable: 3.7 tok/s (fused IQ2 NEON dot, 6 threads)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>1 parent 2c6ea08 commit d519ddf
5 files changed
Lines changed: 1014 additions & 2 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
27 | 27 | | |
28 | 28 | | |
29 | 29 | | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
30 | 40 | | |
31 | 41 | | |
32 | 42 | | |
| |||
36 | 46 | | |
37 | 47 | | |
38 | 48 | | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
39 | 55 | | |
40 | 56 | | |
41 | 57 | | |
| |||
139 | 155 | | |
140 | 156 | | |
141 | 157 | | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
142 | 163 | | |
143 | 164 | | |
144 | 165 | | |
| |||
0 commit comments