Skip to content
#

turboquant

Here are 133 public repositories matching this topic...

Self-hosted AI agent OS. Your memory, chat, agents, and files stay on hardware you own, offline by default, cloud by choice. Offline AI memory (taOSmd), self-hosted multi-framework group chat, a full web desktop + app store, and auto-clustering across the consumer hardware you already have (Orange/Raspberry Pi, Mac mini, gaming PC).

  • Updated Jun 22, 2026
  • Python

vLLM patcher for Qwen3.6 on consumer NVIDIA — Qwen3.6-35B-A3B-FP8 (192 tok/s, +68% over stock) + Qwen3.6-27B-int4-AutoRound + 256K context. 126 patches: TurboQuant k8v4 KV, MTP/DFlash spec-decode, FULL cudagraph, hybrid GDN streaming, structured boot summary, one-command installer, 1958 tests. v7.72.2.

  • Updated May 12, 2026
  • Python

Native Windows build of vLLM 0.21.0 — no WSL, no Docker. Now for RTX 50-series (Blackwell, sm_120): Python 3.13 + CUDA 12.8 + PyTorch 2.11. Pre-built wheel + Windows patch, 10 KV-cache compression dtypes, and the OpenAI API server fixed to run on Windows.

  • Updated May 26, 2026
  • Python

Unified KV cache compression for LLM inference — TurboQuant, IsoQuant, PlanarQuant, TriAttention. 10 methods, GPU-validated, multi-GPU planner. Compress KV cache 5-80x to run bigger models, longer context, more agents on your GPU.

  • Updated May 11, 2026
  • Python

Near-optimal vector quantization from Google's ICLR 2026 paper — 95% recall, 5x compression, zero preprocessing, pure Python FAISS replacement

  • Updated Mar 28, 2026
  • Python

Improve this page

Add a description, image, and links to the turboquant topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the turboquant topic, visit your repo's landing page and select "manage topics."

Learn more