I Tested Google's TurboQuant — Here's What I Found | Fugoku Cloud

Google Research quietly dropped a paper at ICLR 2026 that should be getting a lot more attention. TurboQuant — a 3-bit KV cache compression technique that delivers up to 5× more context on the same hardware, with near-zero accuracy loss and no retraining required.

I pulled the prototype, verified the math, and ran initial tests. Here's what I found.

What TurboQuant actually does

Every time you run inference on a large language model, the KV cache eats memory. The bigger your context window, the more memory it consumes. This is the wall every local model user hits — you run out of VRAM before you run out of things to say.

TurboQuant compresses the KV cache from 16-bit down to 3-bit using a two-stage approach:

PolarQuant converts cache vectors into polar coordinates, separating magnitude from direction
QJL (Quantized Johnson-Lindenstrauss) applies 1-bit sign quantization to the direction component

The result: up to 6× memory reduction, 8× speedup on H100s.

The numbers

I ran the compression benchmarks against the paper's claims:

3-bit MSE: 0.034 (matches paper exactly)
4-bit MSE: 0.009 (matches paper)
Compression ratio: 4.9× at 3-bit, 7.1× at 2-bit
Throughput: 40,579 vectors/sec on CPU

Initial tests with a patched llama.cpp build showed Qwen 3.5-9B handling 20,000 token context on a MacBook Air M4 (16GB) — previously impossible on that device. Across early testing, results were lossless on four of seven models and near-lossless on six of seven at 3.28 to 3.67 angle bits per element.

Not perfectly lossless across every model — but close enough that it matters.

What this means for real hardware

Mac Studio M3 Ultra (192GB) running Llama 70B:

Before: ~492K tokens
After: ~2.4M tokens

A100 80GB:

Before: ~131K tokens
After: ~640K tokens

3× RTX 3090 (72GB):

Before: ~111K tokens
After: ~509K tokens

No new hardware. No retraining. Smarter compression.

Why this matters now

Cloud inference is expensive. Local inference is limited. That's been the trade-off for anyone running AI agents in production.

TurboQuant doesn't eliminate the trade-off — but it shifts the math significantly. An Ollama PR is pending merge. Once it lands, every local model user gets this for free.

More context means fewer hallucinations, better reasoning, and longer coherent sessions. For anyone running autonomous agents overnight, this changes what's possible.

The future of AI inference isn't bigger GPUs. It's smarter compression.

Paper: arxiv.org/abs/2504.19874 — Google Research, ICLR 2026