You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+23-19Lines changed: 23 additions & 19 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,19 +1,20 @@
1
1
# TurboQuant.cpp
2
2
3
-
**Pure C inference engine with [TurboQuant](https://arxiv.org/abs/2504.19874) (ICLR 2026) KV cache compression.**
3
+
**Standalone C inference engine with [TurboQuant](https://arxiv.org/abs/2504.19874) (ICLR 2026) KV cache compression. Not a wrapper — built from scratch, zero dependencies.**
GGUF support: Q8_0 verified. K-quant (Q4_K, Q6_K) and IQ2 dequantization are implemented but not yet quality-verified — contributions welcome.
50
+
MoE architecture (Qwen3.5-35B-A3B): loading and routing implemented, quality verification in progress.
47
51
48
52
---
49
53
@@ -134,15 +138,15 @@ Every NEON path verified against scalar reference (`test_neon_scalar`). A Q4 deq
134
138
147 ns per 128-dim vector (NEON-vectorized). 1-bit attention: 1.2 ns/key. Compared to matmul (~1ms/layer), negligible. See `bench/bench_kv_overhead.cpp`.
135
139
136
140
**Q: "Only small models?"**
137
-
Qwen3.5-35B-A3B MoE runs on a 16GB Mac Air (RSS 4.7GB). GGUF direct loading supports Q2_K through Q6_K and IQ2 formats.
141
+
GGUF Q8_0 loading is verified for Qwen3.5 0.8B. MoE architecture (35B-A3B) loads and routes correctly; K-quant/IQ2 dequantization quality is being stabilized. The engine and KV compression are architecture-independent — verified on models from 270M to 4B.
138
142
139
143
---
140
144
141
145
## Under the Hood
142
146
143
147
-**15,000+ lines of C** — zero external dependencies
144
-
-**GGUF v3 direct loading** — use llama.cpp models without conversion
0 commit comments