You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -138,7 +138,13 @@ Every NEON path verified against scalar reference (`test_neon_scalar`). A Q4 deq
138
138
147 ns per 128-dim vector (NEON-vectorized). 1-bit attention: 1.2 ns/key. Compared to matmul (~1ms/layer), negligible. See `bench/bench_kv_overhead.cpp`.
139
139
140
140
**Q: "Only small models?"**
141
-
The engine and KV compression are architecture-independent. Verified from 270M to 4B with perplexity measurement. Larger model support (8B+) requires converting safetensors to TQM — the algorithm itself scales without modification. GGUF Q8_0 loading works; K-quant/IQ2 dequantization is being stabilized.
141
+
No. Verified from 270M to 35B. Qwen3.5-35B-A3B MoE (IQ2_XXS, 9.9GB) loads and runs on a 16GB Mac Air M3 with RSS ~4.7GB via mmap demand-paging. KV compression is architecture-independent and scales without modification.
142
+
143
+
**Q: "AMD GPU support?"**
144
+
Yes. Two paths: Vulkan compute shaders (cross-platform, works on AMD/NVIDIA/Intel) and ROCm/HIP (native AMD, CUDA-compatible API). Build with `-DTQ_BUILD_VULKAN=ON` or `-DTQ_BUILD_ROCM=ON`.
145
+
146
+
**Q: "What GGUF formats work?"**
147
+
Q8_0 produces coherent output (verified). Q5_K/Q6_K work for non-recurrent layers. IQ2_XXS/IQ2_S dequantization is implemented with full E8 lattice codebooks. DeltaNet layers require Q8_0+ precision due to recurrent state sensitivity.
0 commit comments