https://github.com/youssofal/MTPLX
If you’re running one of the new M5 Macs and you want your local LLMs to feel genuinely fast, MTPLX is worth a look. It’s a free, open-source (Apache-2.0) inference runtime built on Apple’s MLX framework, and it does one thing very well: it speeds up token generation using a technique called native MTP speculative decoding. In plain terms, some models — like Qwen3.6-27B — ship with built-in “multi-token prediction” heads that let the model draft several tokens ahead and then verify them all in a single pass. MTPLX uses those built-in heads as the drafter, so there’s no second model eating your RAM, and crucially it does the math correctly at real sampling temperatures (e.g. temp 0.6 for coding), meaning you get the speedup without degrading output quality.
The headline number is real on the right hardware: roughly 2.24× faster decode, going from about 28 tok/s to 63 tok/s on Qwen3.6-27B — and that benchmark was measured on an M5 Max with fans pinned at full. Independent testers with M5 chips confirmed the gains, with one hitting ~50 tok/s even with a massive 131k-token context, beating llama.cpp’s own MTP implementation. Setup is painless (brew install youssofal/mtplx/mtplx, then mtplx start runs an interactive wizard), and it ships with a proper package: an OpenAI- and Anthropic-compatible API server you can point Claude Code, Cline, or Open WebUI at, plus a browser chat UI, terminal chat, and a tuning command that auto-picks the optimal draft depth for your machine. The default verified model, Youssofal/Qwen3.6-27B-MTPLX-Optimized-Speed, is a ~16GB quant that fits comfortably on any reasonably specced M5.
The honest caveat — and the reason this post is aimed at M5 owners specifically — is that the speedup is heavily tied to newer silicon. The whole approach depends on the model’s “verify” step being cheap, which it is on M5 (~47ms) but isn’t on older M1/M2 chips (130–180ms), where community testers repeatedly found MTP actually made things slower at default settings. It also only speeds up generation, not prompt processing, and it currently works only with MLX models that still have their MTP heads intact (mostly the Qwen3.6 family for now). But if you’ve got an M5, those caveats largely don’t apply to you — this is one of the rare “2× faster” claims that holds up, and it’s an easy win for running a capable 27B model at genuinely usable speeds on a laptop. If you have an M5 Mac, try it.
Leave a Comment