README: honest audit — remove unverified 35B claims, clarify GGUF status

unamedkr · claude · unamedkr · commit 99823b4d5da2 · 2026-04-01T22:18:38.000+09:00
Critical corrections:
- Removed 35B 18.6x/131K claim (IQ2_XXS dequant produces garbage)
- GGUF: Q8_0 verified ✓, K-quant/IQ2 marked as WIP
- MoE: loading works, quality verification in progress
- "Standalone C engine, not a wrapper" — made explicit
- 15,000+ lines (was understated at 10,000+)
- Test count: 31 (fixed all inconsistencies)
- Removed "Llama 8B in progress" (no code evidence)
- EN/KO synchronized

What IS verified:
  TQM: Gemma 4B (PPL +0.03%), Qwen 0.8B, Gemma 270M — all KV types
  GGUF Q8_0: Qwen 0.8B — 1-bit K + Q4 V works
  31/31 tests, ASan clean

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/README.ko.md b/README.ko.md
@@ -1,19 +1,20 @@
 # TurboQuant.cpp
 
-**[TurboQuant](https://arxiv.org/abs/2504.19874) (ICLR 2026) KV 캐시 압축을 구현한 순수 C 추론 엔진.**
+**[TurboQuant](https://arxiv.org/abs/2504.19874) (ICLR 2026) KV 캐시 압축을 구현한 독립형 C 추론 엔진. 래퍼가 아닌 자체 구축, 외부 의존성 없음.**
 
 [![License](https://img.shields.io/badge/license-Apache%202.0-blue)]()
 [![Tests](https://img.shields.io/badge/tests-31%20pass-brightgreen)]()
 [![ASan](https://img.shields.io/badge/ASan%2BUBSan-clean-brightgreen)]()
 
 ```
-Qwen3.5-35B-A3B MoE, 16GB Mac:
-  FP32 KV → 최대 32K 컨텍스트
-  TurboQuant 1b K + Q2 V → 131K 컨텍스트  (18.6x KV 압축)
-
 Gemma 3 4B perplexity (101 토큰, teacher-forced):
   FP16 KV:         PPL = 35.99
-  1-bit K + Q4 V:  PPL = 36.00  (+0.03%)
+  1-bit K + Q4 V:  PPL = 36.00  (+0.03%)   ← 4.9x 압축, 품질 손실 거의 없음
+
+32K 컨텍스트 메모리 (Gemma 3 4B):
+  FP16 K+V:          4,352 MB
+  1-bit K + Q4 V:      885 MB   (4.9x, 3.4 GB 절약)
+  1-bit K + Q2 V:      613 MB   (7.1x, 3.7 GB 절약)
 ```
 
 ---
@@ -38,12 +39,15 @@ cmake --build build -j$(nproc)
 
 | 모델 | 파라미터 | 포맷 | 속도 | KV 압축 |
 |------|----------|------|------|---------|
-| **Qwen3.5-35B-A3B** | 35B (3B 활성) | GGUF | 0.5 tok/s | 18.6x (1b K + Q2 V) |
-| **Gemma 3 4B** | 4B | TQM | 20.2 tok/s | 4.9x–7.1x |
-| **Qwen3.5-0.8B** | 752M | TQM/GGUF | 80.1 tok/s | 4.9x–7.1x |
-| **Gemma 3 270M** | 270M | TQM | 176 tok/s | 4.9x–7.1x |
+| **Gemma 3 4B** | 4B | TQM | 20.2 tok/s | PPL +0.03%, 모든 KV 타입 ✓ |
+| **Qwen3.5-0.8B** | 752M | TQM | 80.1 tok/s | 모든 KV 타입 ✓ |
+| **Qwen3.5-0.8B** | 752M | GGUF Q8_0 | 3.7 tok/s | 1b K + Q4 V ✓ |
+| **Gemma 3 270M** | 270M | TQM | 176 tok/s | 모든 KV 타입 ✓ |
+
+아키텍처: Gemma 3 (슬라이딩 윈도우, GeGLU), Qwen3.5 (DeltaNet 하이브리드).
 
-아키텍처: Gemma 3 (슬라이딩 윈도우, GeGLU), Qwen3.5 (DeltaNet 하이브리드), Qwen2-MoE (top-K 라우팅, 공유 전문가).
+GGUF 지원: Q8_0 검증 완료. K-quant(Q4_K, Q6_K) 및 IQ2 역양자화는 구현되었으나 품질 미검증 — 기여 환영.
+MoE 아키텍처 (Qwen3.5-35B-A3B): 로딩과 라우팅 구현 완료, 품질 검증 진행 중.
 
 ---
 
@@ -134,15 +138,15 @@ llama.cpp는 uniform min-max. TurboQuant는 RHT + Lloyd-Max + QJL 잔차 보정
 128차원 벡터당 147 ns (NEON 벡터화). 1-bit attention: 1.2 ns/key. matmul (~1ms/레이어) 대비 무시 가능. `bench/bench_kv_overhead.cpp` 참조.
 
 **Q: "소형 모델만 지원?"**
-Qwen3.5-35B-A3B MoE가 16GB Mac Air에서 동작 (RSS 4.7GB). GGUF 직접 로딩으로 Q2_K~Q6_K, IQ2 포맷 지원.
+GGUF Q8_0은 Qwen3.5 0.8B에서 검증 완료. MoE 아키텍처(35B-A3B)는 로딩과 라우팅이 구현되어 있으며, K-quant/IQ2 역양자화 품질을 안정화 중. 엔진과 KV 압축은 아키텍처 독립적 — 270M~4B에서 검증.
 
 ---
 
 ## 기술 상세
 
 - **15,000줄+ 순수 C** — 외부 의존성 없음
-- **GGUF v3 직접 로딩** — llama.cpp 모델을 변환 없이 사용
-- **MoE 지원** — top-K expert 라우팅, 공유 전문가, SwiGLU
+- **GGUF v3 로딩** — Q8_0 검증 완료; K-quant/IQ2 역양자화 구현 (품질 WIP)
+- **MoE 라우팅** — top-K expert 선택, 공유 전문가, SwiGLU (품질 WIP)
 - **12개 KV 양자화 타입** — Uniform, PolarQuant, QJL, TurboQuant, TurboQuant KV (1/3/4-bit)
 - **Fused Q4 attention** — packed nibble에서 직접 가중합
 - **적응적 압축** — 레이어별 비트 추천, 코드북 캘리브레이션
diff --git a/README.md b/README.md
@@ -1,19 +1,20 @@
 # TurboQuant.cpp
 
-**Pure C inference engine with [TurboQuant](https://arxiv.org/abs/2504.19874) (ICLR 2026) KV cache compression.**
+**Standalone C inference engine with [TurboQuant](https://arxiv.org/abs/2504.19874) (ICLR 2026) KV cache compression. Not a wrapper — built from scratch, zero dependencies.**
 
 [![License](https://img.shields.io/badge/license-Apache%202.0-blue)]()
 [![Tests](https://img.shields.io/badge/tests-31%20pass-brightgreen)]()
 [![ASan](https://img.shields.io/badge/ASan%2BUBSan-clean-brightgreen)]()
 
 ```
-Qwen3.5-35B-A3B MoE on 16GB Mac:
-  FP32 KV → max 32K context
-  TurboQuant 1b K + Q2 V → 131K context  (18.6x KV compression)
-
 Gemma 3 4B perplexity (101 tokens, teacher-forced):
   FP16 KV:         PPL = 35.99
-  1-bit K + Q4 V:  PPL = 36.00  (+0.03%)
+  1-bit K + Q4 V:  PPL = 36.00  (+0.03%)   ← 4.9x compression, near-zero quality loss
+
+32K context memory (Gemma 3 4B):
+  FP16 K+V:          4,352 MB
+  1-bit K + Q4 V:      885 MB   (4.9x, 3.4 GB saved)
+  1-bit K + Q2 V:      613 MB   (7.1x, 3.7 GB saved)
 ```
 
 ---
@@ -25,25 +26,28 @@ git clone https://github.com/quantumaikr/TurboQuant.cpp && cd TurboQuant.cpp
 cmake -B build -DCMAKE_BUILD_TYPE=Release -DTQ_BUILD_TESTS=ON
 cmake --build build -j$(nproc)
 
-# TQM format (pre-converted)
+# TQM format (recommended — fully verified)
 ./build/tq_run model.tqm -p "Hello" -k turbo_kv_1b -v q4
 
-# GGUF format (llama.cpp models directly)
-./build/tq_run model.gguf -p "Hello" -k turbo_kv_1b -v q4
+# GGUF Q8_0 format (verified)
+./build/tq_run model-Q8_0.gguf -p "Hello" -k turbo_kv_1b -v q4
 ```
 
 ---
 
 ## Supported Models
 
-| Model | Params | Format | Speed | KV Compression |
-|-------|--------|--------|-------|----------------|
-| **Qwen3.5-35B-A3B** | 35B (3B active) | GGUF | 0.5 tok/s | 18.6x (1b K + Q2 V) |
-| **Gemma 3 4B** | 4B | TQM | 20.2 tok/s | 4.9x–7.1x |
-| **Qwen3.5-0.8B** | 752M | TQM/GGUF | 80.1 tok/s | 4.9x–7.1x |
-| **Gemma 3 270M** | 270M | TQM | 176 tok/s | 4.9x–7.1x |
+| Model | Params | Format | Speed (6T) | KV Verified |
+|-------|--------|--------|------------|-------------|
+| **Gemma 3 4B** | 4B | TQM | 20.2 tok/s | PPL +0.03%, all KV types ✓ |
+| **Qwen3.5-0.8B** | 752M | TQM | 80.1 tok/s | all KV types ✓ |
+| **Qwen3.5-0.8B** | 752M | GGUF Q8_0 | 3.7 tok/s | 1b K + Q4 V ✓ |
+| **Gemma 3 270M** | 270M | TQM | 176 tok/s | all KV types ✓ |
+
+Architectures: Gemma 3 (sliding window, GeGLU), Qwen3.5 (DeltaNet hybrid).
 
-Architectures: Gemma 3 (sliding window, GeGLU), Qwen3.5 (DeltaNet hybrid), Qwen2-MoE (top-K routing, shared expert).
+GGUF support: Q8_0 verified. K-quant (Q4_K, Q6_K) and IQ2 dequantization are implemented but not yet quality-verified — contributions welcome.
+MoE architecture (Qwen3.5-35B-A3B): loading and routing implemented, quality verification in progress.
 
 ---
 
@@ -134,15 +138,15 @@ Every NEON path verified against scalar reference (`test_neon_scalar`). A Q4 deq
 147 ns per 128-dim vector (NEON-vectorized). 1-bit attention: 1.2 ns/key. Compared to matmul (~1ms/layer), negligible. See `bench/bench_kv_overhead.cpp`.
 
 **Q: "Only small models?"**
-Qwen3.5-35B-A3B MoE runs on a 16GB Mac Air (RSS 4.7GB). GGUF direct loading supports Q2_K through Q6_K and IQ2 formats.
+GGUF Q8_0 loading is verified for Qwen3.5 0.8B. MoE architecture (35B-A3B) loads and routes correctly; K-quant/IQ2 dequantization quality is being stabilized. The engine and KV compression are architecture-independent — verified on models from 270M to 4B.
 
 ---
 
 ## Under the Hood
 
 - **15,000+ lines of C** — zero external dependencies
-- **GGUF v3 direct loading** — use llama.cpp models without conversion
-- **MoE support** — top-K expert routing, shared expert, SwiGLU
+- **GGUF v3 loading** — Q8_0 verified; K-quant/IQ2 dequant implemented (quality WIP)
+- **MoE routing** — top-K expert selection, shared expert, SwiGLU (quality WIP)
 - **12 KV quantization types** — Uniform, PolarQuant, QJL, TurboQuant, TurboQuant KV (1/3/4-bit)
 - **Fused Q4 attention** — weighted sum directly from packed nibbles
 - **Adaptive compression** — per-layer bit recommendation, codebook calibration