How I built a routing proxy that brings Google’s TurboQuant compression to hybrid-attention models running locally on a Mac.
A month ago, Google published a blog post about a paper called TurboQuant. Memory chip stocks lost billions in market cap within hours. The internet called it real-life Pied Piper. Then the sceptics arrived, then the open-source community arrived, and now we’re here — with working code, known limitations, and a much clearer picture of what KV-cache compression actually means for people running LLMs locally.
This post walks through tqstack, a project I built to bring TurboQuant’s ideas to my specific setup: a 64 GB Apple Silicon Mac running Qwen 3.5 and Qwen 3.6 through Ollama, with a single OpenAI-compatible API in front of everything.
The repo is at github.com/eplt/tqstack.
The problem I was trying to solve
I run LLMs locally for coding, document analysis, and general experimentation. My hardware is a Mac with 64 GB of unified memory. My model of choice is qwen3.5:35b-a3b-coding-nvfp4 — a 35-billion-parameter mixture-of-experts model with only 3 billion active parameters per token. It runs well on Ollama 0.19 with the MLX backend enabled, decoding at over 100 tokens per second.
The problem shows up at long context lengths. At 8K tokens, the KV cache is around 80 MB — negligible. But at 128K tokens it grows to over 1.2 GB, and at the model’s native 262K context limit it approaches 2.6 GB. That’s memory that can’t be used for anything else, and on a unified-memory machine it directly competes with the model weights sitting in the same 64 GB pool.
TurboQuant compresses the KV cache through polar-coordinate quantization — rotating vectors into a normalised space and then quantizing with Lloyd-Max codebooks. At 4-bit precision with rotation, this achieves roughly 3.5× compression with less than 1% perplexity loss on most models. That 2.6 GB cache at 262K tokens becomes around 750 MB. On a memory-constrained machine, that’s the difference between a context window that works and one that doesn’t.
The catch: Qwen 3.5 isn’t a normal transformer
TurboQuant was designed for models that use standard scaled dot-product attention (SDPA) in every layer. Llama, Mistral, Qwen 2.5, Phi, Gemma — these are all pure-SDPA models where every layer maintains a KV cache that grows with sequence length. TurboQuant compresses all of them.
Qwen 3.5 is different. It uses a hybrid attention architecture that interleaves two types of layers in a 3:1 ratio across its 40-layer stack:
- Gated DeltaNet (GDN) layers — 30 of the 40 layers. These use linear attention with a gated delta update rule. Instead of maintaining a KV cache, each GDN layer compresses the entire input history into a fixed-size recurrent state matrix of shape
(batch, 32 heads, 128, 128). This state is roughly 512 KB per layer and does not grow with sequence length. - Softmax attention layers — 10 of the 40 layers, at indices 3, 7, 11, 15, 19, 23, 27, 31, 35, and 39. These are standard grouped-query attention with 16 Q heads and 2 KV heads. They maintain a conventional KV cache that grows linearly with context.
The pattern repeats 10 times: three GDN layers, then one softmax layer, each followed by a mixture-of-experts feedforward block.
This means that naively applying TurboQuant to all 40 layers would fail. The GDN layers have no KV tensors to quantize — attempting to create a TurboQuant cache for them would either crash or silently waste memory on empty structures. The existing open-source TurboQuant implementations, including the one from sharpner/turboquant-mlx, iterate over every layer and allocate a cache for each. That works for Llama. It doesn’t work for Qwen 3.5.
Qwen 3.6, which shipped on April 16, uses the same hybrid layout — 40 layers, same 3:1 ratio, same head dimensions. Everything in tqstack that works for Qwen 3.5 works for Qwen 3.6 without modification.
What tqstack does
tqstack is a three-component stack that runs on localhost:
- Ollama (port 11434) — handles most requests. With
OLLAMA_MLX=1set, Ollama 0.19+ uses Apple’s MLX framework for inference, roughly doubling decode speed on supported models. This is the fast path for normal-length conversations. - TurboQuant sidecar (port 8001) — a FastAPI service that loads an MLX-format model, applies TurboQuant V2 cache compression to only the softmax attention layers, and serves an OpenAI-compatible chat completions endpoint with streaming support.
- Routing proxy (port 8000) — a FastAPI service that presents a single
/v1/chat/completionsendpoint. It estimates the token count of incoming requests, checks the sidecar’s health (with 30-second caching), and forwards each request to the appropriate backend. Short requests go to Ollama. Long requests go to the sidecar. If the sidecar is down, everything falls back to Ollama silently.
The key technical contribution is in the sidecar’s cache construction. Instead of creating a TurboQuant cache for every layer, tqstack inspects the loaded model at startup, classifies each layer as either GDN or softmax, and builds a hybrid cache list:
cache = []
for i, layer in enumerate(model.layers):
if is_softmax_layer(layer):
cache.append(TurboQuantKVCacheV2(
head_dim=256, bits=4, group_size=64,
use_rotation=True, use_normalization=True
))
else:
cache.append(None) # GDN layer — no KV cache to compressDetection works two ways. For known architectures like Qwen 3.5 and 3.6, the softmax layers are at predictable indices (every 4th layer starting from index 3). For unknown architectures, tqstack falls back to inspecting each layer’s attributes — looking for k_proj and v_proj (softmax indicators) versus in_proj_qkvz and conv1d (GDN indicators).
The SDPA dispatch patch from turboquant-mlx is similarly made layer-aware. Only softmax layers get their attention function replaced with the TurboQuant quantized version. GDN layers are left completely untouched.
Why a routing proxy instead of just one backend?
Two reasons: speed and reliability.
For short prompts — the vast majority of interactive use — Ollama with the MLX backend is faster than the sidecar. It runs the model in NVFP4 format, handles the KV cache natively in fp16 or q8_0, and benefits from Ollama’s model caching and keep-alive behaviour. Decode speed on my machine is around 112 tokens per second for Qwen 3.5.
The sidecar is slower per-token because TurboQuant’s dequantization during attention adds overhead. At 8K context, the TurboQuant V2 4-bit path runs at roughly 105% of fp16 speed — nearly free. But at higher compression levels (3-bit with rotation and QJL correction), throughput can drop to 30% of fp16. The sidecar earns its keep at long contexts where the memory savings let you run prompts that would otherwise fail or require aggressive KV eviction.
The routing threshold reflects this tradeoff. For hybrid models like Qwen 3.5, the default threshold is 16,000 tokens — much higher than the 6,000-token default for pure-SDPA models. This is because the hybrid architecture already reduces KV cache pressure by roughly 75% (only 10 of 40 layers have caches), so the sidecar’s compression only becomes worthwhile at genuinely long contexts.
The fallback logic is simple: if the sidecar’s health check fails, the router stops sending requests to it until the next health check succeeds. No requests are dropped. The user never sees an error from a sidecar crash — they just get responses from Ollama instead.
The memory picture on a 64 GB Mac
Here’s what the stack actually looks like in memory:
| Component | Memory |
|---|---|
| Ollama model weights (NVFP4) | ~22 GB |
| GDN recurrent states (30 layers) | ~15 MB |
| GDN conv states | ~0.7 MB |
| Softmax KV cache at 8K (fp16, Ollama) | ~80 MB |
| Softmax KV cache at 128K (TQ 4-bit, sidecar) | ~366 MB |
| macOS + system overhead | ~8 GB |
| Total at 128K context | ~30.5 GB |
That leaves over 30 GB of headroom on a 64 GB machine. Without TurboQuant, the 128K KV cache would be ~1.28 GB — still manageable, but the savings compound at longer contexts and leave more room for the operating system, browser, and other applications competing for the same unified memory pool.
On a 32 GB machine, the math is tighter. The model weights alone consume 22 GB, leaving only 10 GB for everything else. tqstack’s installer detects available RAM and disables the sidecar on machines with less than 32 GB, falling back to Ollama-only mode. On 32 GB machines where the sidecar is enabled, it uses lazy loading — the model is loaded only when a long-context request arrives and unloaded after a timeout.
What I learned building this
The layer-index bug is real and subtle. The turboquant_plus project documented a bug where boundary-layer selection (keeping the first and last few layers at higher precision for quality) uses raw transformer layer indices instead of KV-layer ordinals. On a hybrid model, raw indices 0, 1, 38, 39 are all GDN layers — they have no KV cache. The “boundary protection” does nothing. tqstack fixes this by building a separate list of softmax-layer indices and using position within that list for boundary decisions.
Chat templates matter more than you’d think. The original poorsman project hardcoded chat formatting with generic <|role|>tags. Qwen 3.5 uses specific token IDs (BOS 248044, EOS 248046) and its own role formatting. Using tokenizer.apply_chat_template() instead of string concatenation fixed subtle generation quality issues that were hard to diagnose.
Ollama 0.19 changed the game. The older approach of setting OLLAMA_KV_CACHE_TYPE=q8_0 and OLLAMA_FLASH_ATTENTION=1 as environment variables still works, but Ollama’s native MLX backend (OLLAMA_MLX=1) handles much of this automatically. Some of the original poorsman environment variables may be silently ignored in MLX mode. tqstack’s preflight check detects the Ollama version and adjusts configuration accordingly.
Streaming through a proxy is fiddly. OpenAI’s SSE format requires each chunk to be prefixed with data: and followed by two newlines, with a final data: [DONE] sentinel. Getting this right through two levels of proxying (client → router → sidecar/Ollama) required careful attention to newline handling. Ollama’s streaming format for its native /api/chat endpoint is different from its OpenAI-compatible /v1/chat/completions endpoint, so the router must use the latter exclusively.
Where this is heading
TurboQuant will eventually land natively in the tools we already use. There’s an open pull request on mlx-lm adding experimental TurboQuant cache support. Ollama has a feature request for native TurboQuant integration. The llama.cpp discussion has a working CPU implementation ready for review.
When that happens, the sidecar becomes unnecessary — but the router doesn’t. A single endpoint that monitors backend health, routes by context length, and presents a unified OpenAI-compatible API is useful regardless of whether compression happens in a sidecar or natively in Ollama. That’s why tqstack is designed with a configurable TQ_CACHE_VERSION that can be set to native once the underlying tools catch up.
For now, the stack works. I use it daily for coding with Qwen 3.5 through Open WebUI and the OpenAI Python SDK. Long-context tasks like codebase analysis and document summarisation route to the sidecar automatically. Everything else goes through Ollama at full speed. One endpoint, one URL, no manual switching.
Try it
git clone https://github.com/eplt/tqstack.git
cd tqstack
bash install.shRequirements: macOS 14+ on Apple Silicon, Ollama 0.19+, Python 3.10+, and at least 32 GB of unified memory.
The repo is at github.com/eplt/tqstack. Issues and contributions welcome.

Leave a Comment