Skip to content

Commit 99823b4

Browse files
unamedkrclaude
andcommitted
README: honest audit — remove unverified 35B claims, clarify GGUF status
Critical corrections: - Removed 35B 18.6x/131K claim (IQ2_XXS dequant produces garbage) - GGUF: Q8_0 verified ✓, K-quant/IQ2 marked as WIP - MoE: loading works, quality verification in progress - "Standalone C engine, not a wrapper" — made explicit - 15,000+ lines (was understated at 10,000+) - Test count: 31 (fixed all inconsistencies) - Removed "Llama 8B in progress" (no code evidence) - EN/KO synchronized What IS verified: TQM: Gemma 4B (PPL +0.03%), Qwen 0.8B, Gemma 270M — all KV types GGUF Q8_0: Qwen 0.8B — 1-bit K + Q4 V works 31/31 tests, ASan clean Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent c0b96e1 commit 99823b4

2 files changed

Lines changed: 41 additions & 33 deletions

File tree

README.ko.md

Lines changed: 18 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,20 @@
11
# TurboQuant.cpp
22

3-
**[TurboQuant](https://arxiv.org/abs/2504.19874) (ICLR 2026) KV 캐시 압축을 구현한 순수 C 추론 엔진.**
3+
**[TurboQuant](https://arxiv.org/abs/2504.19874) (ICLR 2026) KV 캐시 압축을 구현한 독립형 C 추론 엔진. 래퍼가 아닌 자체 구축, 외부 의존성 없음.**
44

55
[![License](https://img.shields.io/badge/license-Apache%202.0-blue)]()
66
[![Tests](https://img.shields.io/badge/tests-31%20pass-brightgreen)]()
77
[![ASan](https://img.shields.io/badge/ASan%2BUBSan-clean-brightgreen)]()
88

99
```
10-
Qwen3.5-35B-A3B MoE, 16GB Mac:
11-
FP32 KV → 최대 32K 컨텍스트
12-
TurboQuant 1b K + Q2 V → 131K 컨텍스트 (18.6x KV 압축)
13-
1410
Gemma 3 4B perplexity (101 토큰, teacher-forced):
1511
FP16 KV: PPL = 35.99
16-
1-bit K + Q4 V: PPL = 36.00 (+0.03%)
12+
1-bit K + Q4 V: PPL = 36.00 (+0.03%) ← 4.9x 압축, 품질 손실 거의 없음
13+
14+
32K 컨텍스트 메모리 (Gemma 3 4B):
15+
FP16 K+V: 4,352 MB
16+
1-bit K + Q4 V: 885 MB (4.9x, 3.4 GB 절약)
17+
1-bit K + Q2 V: 613 MB (7.1x, 3.7 GB 절약)
1718
```
1819

1920
---
@@ -38,12 +39,15 @@ cmake --build build -j$(nproc)
3839

3940
| 모델 | 파라미터 | 포맷 | 속도 | KV 압축 |
4041
|------|----------|------|------|---------|
41-
| **Qwen3.5-35B-A3B** | 35B (3B 활성) | GGUF | 0.5 tok/s | 18.6x (1b K + Q2 V) |
42-
| **Gemma 3 4B** | 4B | TQM | 20.2 tok/s | 4.9x–7.1x |
43-
| **Qwen3.5-0.8B** | 752M | TQM/GGUF | 80.1 tok/s | 4.9x–7.1x |
44-
| **Gemma 3 270M** | 270M | TQM | 176 tok/s | 4.9x–7.1x |
42+
| **Gemma 3 4B** | 4B | TQM | 20.2 tok/s | PPL +0.03%, 모든 KV 타입 ✓ |
43+
| **Qwen3.5-0.8B** | 752M | TQM | 80.1 tok/s | 모든 KV 타입 ✓ |
44+
| **Qwen3.5-0.8B** | 752M | GGUF Q8_0 | 3.7 tok/s | 1b K + Q4 V ✓ |
45+
| **Gemma 3 270M** | 270M | TQM | 176 tok/s | 모든 KV 타입 ✓ |
46+
47+
아키텍처: Gemma 3 (슬라이딩 윈도우, GeGLU), Qwen3.5 (DeltaNet 하이브리드).
4548

46-
아키텍처: Gemma 3 (슬라이딩 윈도우, GeGLU), Qwen3.5 (DeltaNet 하이브리드), Qwen2-MoE (top-K 라우팅, 공유 전문가).
49+
GGUF 지원: Q8_0 검증 완료. K-quant(Q4_K, Q6_K) 및 IQ2 역양자화는 구현되었으나 품질 미검증 — 기여 환영.
50+
MoE 아키텍처 (Qwen3.5-35B-A3B): 로딩과 라우팅 구현 완료, 품질 검증 진행 중.
4751

4852
---
4953

@@ -134,15 +138,15 @@ llama.cpp는 uniform min-max. TurboQuant는 RHT + Lloyd-Max + QJL 잔차 보정
134138
128차원 벡터당 147 ns (NEON 벡터화). 1-bit attention: 1.2 ns/key. matmul (~1ms/레이어) 대비 무시 가능. `bench/bench_kv_overhead.cpp` 참조.
135139

136140
**Q: "소형 모델만 지원?"**
137-
Qwen3.5-35B-A3B MoE가 16GB Mac Air에서 동작 (RSS 4.7GB). GGUF 직접 로딩으로 Q2_K~Q6_K, IQ2 포맷 지원.
141+
GGUF Q8_0은 Qwen3.5 0.8B에서 검증 완료. MoE 아키텍처(35B-A3B)는 로딩과 라우팅이 구현되어 있으며, K-quant/IQ2 역양자화 품질을 안정화 중. 엔진과 KV 압축은 아키텍처 독립적 — 270M~4B에서 검증.
138142

139143
---
140144

141145
## 기술 상세
142146

143147
- **15,000줄+ 순수 C** — 외부 의존성 없음
144-
- **GGUF v3 직접 로딩**llama.cpp 모델을 변환 없이 사용
145-
- **MoE 지원** — top-K expert 라우팅, 공유 전문가, SwiGLU
148+
- **GGUF v3 로딩**Q8_0 검증 완료; K-quant/IQ2 역양자화 구현 (품질 WIP)
149+
- **MoE 라우팅** — top-K expert 선택, 공유 전문가, SwiGLU (품질 WIP)
146150
- **12개 KV 양자화 타입** — Uniform, PolarQuant, QJL, TurboQuant, TurboQuant KV (1/3/4-bit)
147151
- **Fused Q4 attention** — packed nibble에서 직접 가중합
148152
- **적응적 압축** — 레이어별 비트 추천, 코드북 캘리브레이션

README.md

Lines changed: 23 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,20 @@
11
# TurboQuant.cpp
22

3-
**Pure C inference engine with [TurboQuant](https://arxiv.org/abs/2504.19874) (ICLR 2026) KV cache compression.**
3+
**Standalone C inference engine with [TurboQuant](https://arxiv.org/abs/2504.19874) (ICLR 2026) KV cache compression. Not a wrapper — built from scratch, zero dependencies.**
44

55
[![License](https://img.shields.io/badge/license-Apache%202.0-blue)]()
66
[![Tests](https://img.shields.io/badge/tests-31%20pass-brightgreen)]()
77
[![ASan](https://img.shields.io/badge/ASan%2BUBSan-clean-brightgreen)]()
88

99
```
10-
Qwen3.5-35B-A3B MoE on 16GB Mac:
11-
FP32 KV → max 32K context
12-
TurboQuant 1b K + Q2 V → 131K context (18.6x KV compression)
13-
1410
Gemma 3 4B perplexity (101 tokens, teacher-forced):
1511
FP16 KV: PPL = 35.99
16-
1-bit K + Q4 V: PPL = 36.00 (+0.03%)
12+
1-bit K + Q4 V: PPL = 36.00 (+0.03%) ← 4.9x compression, near-zero quality loss
13+
14+
32K context memory (Gemma 3 4B):
15+
FP16 K+V: 4,352 MB
16+
1-bit K + Q4 V: 885 MB (4.9x, 3.4 GB saved)
17+
1-bit K + Q2 V: 613 MB (7.1x, 3.7 GB saved)
1718
```
1819

1920
---
@@ -25,25 +26,28 @@ git clone https://github.com/quantumaikr/TurboQuant.cpp && cd TurboQuant.cpp
2526
cmake -B build -DCMAKE_BUILD_TYPE=Release -DTQ_BUILD_TESTS=ON
2627
cmake --build build -j$(nproc)
2728

28-
# TQM format (pre-converted)
29+
# TQM format (recommended — fully verified)
2930
./build/tq_run model.tqm -p "Hello" -k turbo_kv_1b -v q4
3031

31-
# GGUF format (llama.cpp models directly)
32-
./build/tq_run model.gguf -p "Hello" -k turbo_kv_1b -v q4
32+
# GGUF Q8_0 format (verified)
33+
./build/tq_run model-Q8_0.gguf -p "Hello" -k turbo_kv_1b -v q4
3334
```
3435

3536
---
3637

3738
## Supported Models
3839

39-
| Model | Params | Format | Speed | KV Compression |
40-
|-------|--------|--------|-------|----------------|
41-
| **Qwen3.5-35B-A3B** | 35B (3B active) | GGUF | 0.5 tok/s | 18.6x (1b K + Q2 V) |
42-
| **Gemma 3 4B** | 4B | TQM | 20.2 tok/s | 4.9x–7.1x |
43-
| **Qwen3.5-0.8B** | 752M | TQM/GGUF | 80.1 tok/s | 4.9x–7.1x |
44-
| **Gemma 3 270M** | 270M | TQM | 176 tok/s | 4.9x–7.1x |
40+
| Model | Params | Format | Speed (6T) | KV Verified |
41+
|-------|--------|--------|------------|-------------|
42+
| **Gemma 3 4B** | 4B | TQM | 20.2 tok/s | PPL +0.03%, all KV types ✓ |
43+
| **Qwen3.5-0.8B** | 752M | TQM | 80.1 tok/s | all KV types ✓ |
44+
| **Qwen3.5-0.8B** | 752M | GGUF Q8_0 | 3.7 tok/s | 1b K + Q4 V ✓ |
45+
| **Gemma 3 270M** | 270M | TQM | 176 tok/s | all KV types ✓ |
46+
47+
Architectures: Gemma 3 (sliding window, GeGLU), Qwen3.5 (DeltaNet hybrid).
4548

46-
Architectures: Gemma 3 (sliding window, GeGLU), Qwen3.5 (DeltaNet hybrid), Qwen2-MoE (top-K routing, shared expert).
49+
GGUF support: Q8_0 verified. K-quant (Q4_K, Q6_K) and IQ2 dequantization are implemented but not yet quality-verified — contributions welcome.
50+
MoE architecture (Qwen3.5-35B-A3B): loading and routing implemented, quality verification in progress.
4751

4852
---
4953

@@ -134,15 +138,15 @@ Every NEON path verified against scalar reference (`test_neon_scalar`). A Q4 deq
134138
147 ns per 128-dim vector (NEON-vectorized). 1-bit attention: 1.2 ns/key. Compared to matmul (~1ms/layer), negligible. See `bench/bench_kv_overhead.cpp`.
135139

136140
**Q: "Only small models?"**
137-
Qwen3.5-35B-A3B MoE runs on a 16GB Mac Air (RSS 4.7GB). GGUF direct loading supports Q2_K through Q6_K and IQ2 formats.
141+
GGUF Q8_0 loading is verified for Qwen3.5 0.8B. MoE architecture (35B-A3B) loads and routes correctly; K-quant/IQ2 dequantization quality is being stabilized. The engine and KV compression are architecture-independent — verified on models from 270M to 4B.
138142

139143
---
140144

141145
## Under the Hood
142146

143147
- **15,000+ lines of C** — zero external dependencies
144-
- **GGUF v3 direct loading**use llama.cpp models without conversion
145-
- **MoE support** — top-K expert routing, shared expert, SwiGLU
148+
- **GGUF v3 loading**Q8_0 verified; K-quant/IQ2 dequant implemented (quality WIP)
149+
- **MoE routing** — top-K expert selection, shared expert, SwiGLU (quality WIP)
146150
- **12 KV quantization types** — Uniform, PolarQuant, QJL, TurboQuant, TurboQuant KV (1/3/4-bit)
147151
- **Fused Q4 attention** — weighted sum directly from packed nibbles
148152
- **Adaptive compression** — per-layer bit recommendation, codebook calibration

0 commit comments

Comments
 (0)