I ❤️ A _ _ _ I 🍻

I ported a research KV-cache quantizer KVarN to Apple Silicon from my iPhone in Japan, using a $30 Chinese LLM

I just shipped mlx-kvarn: the first MLX-native implementation of KVarN, a KV-cache quantization method from a 2026 Huawei paper. It gives you up to ~4.7× more KV-cache capacity on a Mac, at near-FP16 speed and accuracy, as a one-line drop-in for mlx-lm.

That’s the what. The why is more fun, and honestly the part I want to write about: I built almost all of it with an AI coding agent driven by a cheap Chinese open model, mostly tapping away in tmux on my phone while travelling around Japan. It cost me less than the price of a couple of coffees. I wanted to see whether that was actually possible — and what the experience taught me about where the cheap models are, and aren’t, good enough.

The itch

KVarN is a neat piece of work. The KV cache — the running memory of attention keys and values — is what eats your RAM as context grows, and on a Mac that RAM is unified and finite. Quantize the cache and you fit longer contexts or bigger models in the same machine. The KVarN paper’s trick is to make low-bit quantization safe: a Hadamard rotation to spread out outliers, then Sinkhorn variance normalization to balance each tile before rounding. The result holds up on reasoning benchmarks where naive 2-bit quantization falls apart.

The catch: the official implementation is CUDA-only, built into a vLLM fork. Nothing for MLX, Apple’s framework. I run models locally on Apple Silicon, there was an obvious gap, and the method was well-documented. Classic “someone should do this” — so I decided to find out if I could, with an agent doing the actual typing.

The setup: a phone, a Mac, and a budget model

The constraints were self-imposed and a little ridiculous, which was the point. My Mac stayed at home running the work; I was on holiday in Tokyo and Osaka with an iPhone. So the toolchain was mosh + tmux over whatever connection I had — shinkansen, hotel wifi, a conbini parking lot — with the AI coding agent running on the Mac and me steering from a phone keyboard. Mosh’s tolerance for flaky links and roaming IPs turned out to be the unsung hero; a plain SSH session would have died a hundred times.

The model doing the coding was Qwen 3.6/3.7 Plus — a cheap, recent Chinese model. I reserved a frontier US model (Opus 4.8) for a narrow role: planning the test specs and reviewing results, not writing code. I wanted a clean read on what the cheap model could carry on its own.

(I wrote about using tmux and mosh in another article recently.)

What actually happened

The honest version is: it worked, and it took way longer than I planned. Roughly a day of cumulative agent time spread across five days, and the project went through something like eight rounds of “looks done” → “actually, no.”

The early rounds were rough in instructive ways. The first working version ran at 1 token per second — 55× slower than FP16 — because the dequantization was a Python loop firing hundreds of tiny GPU kernels per step. The fix (batching it into a single Metal kernel) got it to ~120 tok/s, but diagnosing it required correctly identifying that the bottleneck was dispatch overhead, not compute — a subtle call that an earlier report had gotten flat wrong, blaming a “mysterious framework tax.”

Then the long tail of bugs that only careful testing catches. A quantization preset that silently ignored its own settings because a Metal kernel had a bit-width hardcoded. A “validation” that turned out to be testing random noise in the wrong coordinate frame, making a real result look broken. An accuracy claim that contradicted its own data table. Each one got caught, root-caused, and fixed — and crucially, documented as wrong rather than papered over.

That last part is where the workflow mattered. I structured it as cycles: write a technical spec for the next round of work or testing, have the cheap agent execute it and report back, then use the frontier model to scrutinize the results and write the next spec. The reviewing model repeatedly caught the cheap model overclaiming — “this passed!” when the test had actually measured the wrong thing, or “no accuracy loss” sitting next to a table showing divergence. The division of labor — cheap model for volume, expensive model for judgment — turned out to be the real lesson.

Where it landed

The final v1.0 is something I’m comfortable putting it out on my GitHub, precisely because the claims are scoped to what the tests actually showed:

  • On GSM8K (200 problems, Qwen2.5-3B), KVarN matches FP16 within statistical noise — and beats mlx-lm‘s own built-in quantized cache by a few points at higher compression.
  • Greedy output is token-identical to FP16 most of the time; where it diverges, it’s a single argmax tie-break, not compounding error. (The evidence: the divergence point is fixed regardless of how long you generate — error accumulation would creep earlier, and it doesn’t.)
  • It’s a one-line patch on top of mlx-lm, verified across five model families.

It also has honest limitations — slower prefill, a couple of optimizations that turned out to be blocked by math (RoPE doesn’t commute with the rotation, which kills the obvious speed trick), a fused kernel that’s disabled pending a real bug fix. Those are in the README too, because a port nobody can trust isn’t worth shipping. (Not saying that it is completely bugs free with no errors, but it should be helpful to anyone looking at implementing KVarN on MLX. I only tested it on my first gen MLX hardware, Mac Studio M1 Ultra with 64GB of RAM.)

So, can a $30 Chinese model build this?

Yes — with caveats I think are worth being precise about.

It is genuinely remarkable that a budget open model can carry this much real engineering: Metal kernels, quantization math, debugging dispatch bottlenecks, multi-round refactors. Two years ago this was unthinkable at any price. The cost asymmetry is stark — a frontier model on the same work would have run into the thousands of dollars in API usage; this used a sliver of a $30 monthly plan.

But “it can” isn’t “it’s free.” The cheaper model needed far more supervision. It overclaimed, it fixed bugs while quietly breaking its own earlier conclusions, it occasionally fabricated things (at one low point it invented author names in the citation — exactly the kind of error you cannot ship in a project whose whole point is faithfully crediting someone else’s research). Every one of those was caught only because there was a review loop and a hard rule: no claim ships that exceeds the data. A frontier model would very likely have reached a cleaner result with less hand-holding, and faster.

So the real takeaway isn’t a leaderboard verdict. It’s that the shape of the work changed. The expensive, scarce resource is no longer “can it write the code” — the cheap model can. It’s judgment: knowing when a green checkmark is lying, when a benchmark is measuring the wrong thing, when a claim has drifted from its evidence. For now, that judgment is still worth paying frontier prices for, and it pairs beautifully with cheap volume underneath. Give it six months and even that gap will likely narrow.

I did this because I could, and because finding out exactly how I could was the interesting part. The artifact is a real, useful tool. But the thing I’ll remember is debugging a Metal kernel from a phone on a train, with a model that cost almost nothing, and watching it — slowly, messily, with a lot of supervision — actually get there.

The code is on GitHub. If you run LLMs on a Mac and want longer context, give it a try — and if you’re one of the KVarN authors, thank you for the method; I hope the MLX port does it justice.