Flash-MoE hit the Hacker News front page this week, and the premise is hard to scroll past: a 397-billion parameter model running on a MacBook Pro at around 4.4 tokens per second, with no cloud, no GPU cluster.
The underlying model is Qwen3.5-397B-A17B, a Mixture-of-Experts architecture from Alibaba released in February. The MoE part is what makes this possible. A dense 397B model would require something like 800GB of memory to hold in full — nowhere near consumer hardware. But MoE models don’t work that way. They have many expert modules, and for any given token, only a small subset of those experts activates. In Qwen3.5’s case: 512 experts per layer, with the model routing 10 plus one shared expert per token by default. Flash-MoE prunes this further to 4 active experts per token as a speed tradeoff. The rest of the weights sit idle.
Flash-MoE exploits this directly. Rather than loading the full 209GB model into RAM, it streams experts from the SSD on demand using parallel reads. The M3 Max’s NVMe reads at around 17.5 GB/s, fast enough to keep expert loading from being the dominant bottleneck. The page cache handles repeated expert access naturally, achieving ~71% hit rates without custom caching logic.
This is the practical implementation of an idea Apple’s research team published in 2023 — “LLM in a Flash” — which laid out the theory for using SSD bandwidth to run models larger than available RAM. That paper was influential, but the author’s claim is that usable open-source implementations hadn’t materialized until now — not just “it runs” but “it runs fast enough to have a conversation.”
The inference engine is C and Objective-C with hand-tuned Metal shaders — no PyTorch, no llama.cpp, no framework dependency chain. (The repo does include Python scripts for model packing and experiments, but the inference path itself is bare metal.) The author describes building it in 24 hours using Claude Code to run roughly 90 experiments — leaning on automated search to find the right kernel optimizations rather than tuning by hand. That part of the story should probably be taken with some skepticism (codebases that look clean in retrospect rarely emerge cleanly), but the result is real: the repo is auditable, and the benchmarks have been reproduced.
The hardware requirements matter for context. You need an M3 Max (or equivalent) with 48GB unified memory and a 1TB NVMe. That’s still a $3,000–4,000 machine. This isn’t running on anyone’s base-model laptop. But it’s also a machine that a lot of ML researchers and engineers already own, and the gap between “requires a GPU workstation” and “runs on the MacBook on your desk” is a meaningful one.
The broader implication is about what MoE architecture makes possible at the edge. Dense models scale poorly to local inference because you can’t avoid loading the full parameter count. MoE models can scale to very large total parameter counts while keeping the active compute per token manageable — which means the “total size” and “inference cost” can diverge in ways that enable approaches like this. That property is going to get more interesting as MoE models become more capable and SSDs get faster.
4.4 tokens per second is slow for interactive use, but fine for batch tasks, background summarization, or offline processing. And the technique is not hardware-specific — the SSD-streaming approach should work on any system with fast NVMe and a unified memory architecture. Apple Silicon just happens to be the common consumer hardware that fits.