turboquant

Here are 133 public repositories matching this topic...

RyanCodrai / turbovec

A vector index built on TurboQuant, written in Rust with Python bindings

python rust neon embeddings simd nearest-neighbor quant ann quantization avx512 embedding faiss rag vector-search turboquant

Updated Jun 10, 2026
Python

Anbeeld / beellama.cpp

Star

DFlash & TurboQuant in llama.cpp with up to 3x faster generation and 7.5x more KV cache in same VRAM

inference quantization kv-cache llm llm-serving llama-cpp ggml llm-inference speculative-decoding dflash turboquant

Updated Jun 17, 2026
C++

quantumaikr / quant.cpp

Star

LLM inference with 7x longer context. Pure C, zero dependencies. Lossless KV cache compression + single-header library.

embeddable transformer pure-c quantization delta-compression kv-cache llm llm-inference gguf turboquant

Updated Apr 26, 2026
C

AtomicBot-ai / atomic-llama-cpp-turboquant

Star

llama.cpp fork with TurboQuant WHT-rotated KV cache & weight compression + Gemma 4 MTP and Qwen 3.6 NextN speculative decoding (+30-50% throughput).

Updated Jun 19, 2026
C++

Self-hosted AI agent OS. Your memory, chat, agents, and files stay on hardware you own, offline by default, cloud by choice. Offline AI memory (taOSmd), self-hosted multi-framework group chat, a full web desktop + app store, and auto-clustering across the consumer hardware you already have (Orange/Raspberry Pi, Mac mini, gaming PC).

raspberry-pi privacy offline-first distributed-computing self-hosted orange-pi ai-agents data-sovereignty ai-platform local-first agent-framework apple-silicon llm vllm local-llm llm-inference kv-cache-quantization rockchip-npu turboquant

Updated Jun 22, 2026
Python

PacifAIst / Quansloth

Star

Based on the implementation of Google's TurboQuant (ICLR 2026) — Quansloth brings elite KV cache compression to local LLM inference. Quansloth is a fully private, air-gapped AI server that runs massive context models natively on consumer hardware with ease

cuda turboquant quansloth vram-wall

Updated May 13, 2026
Python

Sandermage / genesis-vllm-patches

Star

vLLM patcher for Qwen3.6 on consumer NVIDIA — Qwen3.6-35B-A3B-FP8 (192 tok/s, +68% over stock) + Qwen3.6-27B-int4-AutoRound + 256K context. 126 patches: TurboQuant k8v4 KV, MTP/DFlash spec-decode, FULL cudagraph, hybrid GDN streaming, structured boot summary, one-command installer, 1958 tests. v7.72.2.

Updated May 12, 2026
Python

arozanov / turboquant-mlx

Star

TurboQuant KV cache compression for MLX with fused Metal kernels. 4.6x compression at 98% FP16 speed.

metal quantization mlx kv-cache apple-silicon llm turboquant

Updated Apr 30, 2026
Python

Indras-Mirror / llama.cpp-turboq-mtp

Star

Fused TBQ4 Flash Attention + MTP + Shared Tensors for llama.cpp — 82+ tok/s with lossless 4.25 bpv KV cache at 200K context on RTX 4090

cuda quantization mtp kv-cache fwht llama-cpp flash-attention qwen speculative-decoding rtx-4090 multi-token-prediction turboquant tbq4 tensor-sharing

Updated May 17, 2026
C++

manjunathshiva / turboquant-mlx

Star

Extreme weight + KV cache compression for LLMs on Apple Silicon (MLX implementation of Google's TurboQuant)

quantization mlx kv-cache apple-silicon llm turboquant

Updated Jun 20, 2026
Python

Alberto-Codes / turboquant-vllm

Star

TurboQuant KV cache compression plugin for vLLM — asymmetric K/V, 8 models validated, consumer GPUs

compression transformer triton quantization inference-optimization kv-cache llm vllm consumer-gpu turboquant

Updated Apr 10, 2026
Python

bigmacfive / turbo-graph

Star

TurboQuant-compatible vector search plus graph memory for constrained RAG.

python rust graph embeddings rag vector-search turboquant turbovec

Updated Jun 19, 2026
Rust

back2matching / turboquant

Star

First open-source TurboQuant KV cache compression for LLM inference. Drop-in for HuggingFace. pip install turboquant.

machine-learning compression gpu transformers inference pytorch quantization vram huggingface kv-cache llm turboquant

Updated Apr 21, 2026
Python

atomicmilkshake / llama-cpp-turboquant

Star

llama.cpp fork with TurboQuant quantization (turbo2/3/4) and TriAttention GPU-accelerated KV cache pruning. 75 tok/s on Qwen3-8B / RTX 3080.

windows cuda inference quantization kv-cache llm llama-cpp ggml turboquant triattention

Updated Apr 9, 2026
C++

aivrar / vllm-windows-build

Star

Native Windows build of vLLM 0.21.0 — no WSL, no Docker. Now for RTX 50-series (Blackwell, sm_120): Python 3.13 + CUDA 12.8 + PyTorch 2.11. Pre-built wheel + Windows patch, 10 KV-cache compression dtypes, and the OpenAI API server fixed to run on Windows.

Updated May 26, 2026
Python

croll83 / llama.cpp-dgx

Star

llama.cpp fork optimized for NVIDIA DGX Spark / GB10 (Blackwell, SM 12.1) — TurboQuant weights + KV, NVFP4, DFlash MTP

blackwell llama-cpp speculative-decoding gb10 nvfp4 dflash turboquant

Updated May 26, 2026
C++

aivrar / multi-turboquant

Star

Unified KV cache compression for LLM inference — TurboQuant, IsoQuant, PlanarQuant, TriAttention. 10 methods, GPU-validated, multi-GPU planner. Compress KV cache 5-80x to run bigger models, longer context, more agents on your GPU.

Updated May 11, 2026
Python

danilodevhub / turboquant-js

Star

TypeScript implementation of Google's TurboQuant algorithm for near-optimal vector quantization. Zero dependencies, works in Node.js and browsers.

machine-learning typescript browser compression embeddings nearest-neighbor quantization vector-quantization vector-search kv-cache llm transformers-js turboquant

Updated Jun 20, 2026
TypeScript

Firmamento-Technologies / TurboQuant

Star

Near-optimal vector quantization from Google's ICLR 2026 paper — 95% recall, 5x compression, zero preprocessing, pure Python FAISS replacement

Updated Mar 28, 2026
Python

artalis-io / bitnet.c

Star

Minimal, zero-dependency LLM inference in pure C11. CPU-first with NEON/AVX2 SIMD. Flash MoE (pread + LRU expert cache). TurboQuant 3-bit KV compression (8.9x less memory per session). 20+ GGUF quant formats. Compiles to WASM.

c neon wasm inference simd moe avx2 quantization kv-cache cpu-inference llm gguf turboquant

Updated Jun 12, 2026
C

Improve this page

Add a description, image, and links to the turboquant topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the turboquant topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

turboquant

Here are 133 public repositories matching this topic...

RyanCodrai / turbovec

Anbeeld / beellama.cpp

quantumaikr / quant.cpp

AtomicBot-ai / atomic-llama-cpp-turboquant

jaylfc / taOS

PacifAIst / Quansloth

Sandermage / genesis-vllm-patches

arozanov / turboquant-mlx

Indras-Mirror / llama.cpp-turboq-mtp

manjunathshiva / turboquant-mlx

Alberto-Codes / turboquant-vllm

bigmacfive / turbo-graph

back2matching / turboquant

atomicmilkshake / llama-cpp-turboquant

aivrar / vllm-windows-build

croll83 / llama.cpp-dgx

aivrar / multi-turboquant

danilodevhub / turboquant-js

Firmamento-Technologies / TurboQuant

artalis-io / bitnet.c

Improve this page

Add this topic to your repo