Memory Is The New Compute

For 35 years, Arm’s business model was elegant: design the ISA and microarchitecture, license it to chipmakers, and let everyone else take the manufacturing risk. Qualcomm, Apple, AWS, Ampere, NVIDIA — they all built Arm-licensed CPUs and competed on implementation. Arm collected royalties and stayed out of the fight.

That changed yesterday. Arm announced the AGI CPU, the company’s first production silicon, with Meta as the debut customer and commercial commitments from OpenAI, Cloudflare, Cerebras, SK Telecom, and others. The chip itself is a serious piece of hardware: 136 Neoverse V3 cores at 3.7 GHz boost, dual chiplet on TSMC 3nm, 300W TDP, DDR5-8800 across 12 channels delivering over 800 GB/s of aggregate memory bandwidth, PCIe Gen 6 and CXL 3.0 for accelerator connectivity. A fully populated 36kW air-cooled rack puts 8,160 cores on the network; Supermicro is also building a 200kW liquid-cooled configuration with 45,000+ cores.

The target isn’t training — it’s what Arm is calling “agentic AI infrastructure,” the CPU-side orchestration that keeps accelerators fed: scheduling work, managing data movement, running the non-GPU parts of inference pipelines. Arm claims 2x performance per rack versus x86, though those numbers are from internal estimates and should be treated accordingly until independent benchmarks land.

The strategic tension here is real. AWS built Graviton because they wanted control over their own compute stack and didn’t want to depend on x86 pricing. Now Arm is effectively doing the same thing, one layer up — and their existing customers are the ones who might feel squeezed. Qualcomm, Apple, Ampere, and AWS have all built data center products using Arm’s IP. They’re now watching Arm enter the market they built. Whether this poisons IP licensing relationships or whether hyperscalers simply treat the AGI CPU as another SKU option will matter a lot for how the next few years play out.

On the same day Arm announced its hardware play, Google Research published a blog post on TurboQuant, a paper being presented at ICLR 2026 that attacks the KV cache problem from the algorithm side. The core idea: compress key-value pairs to approximately 3 bits per channel with no accuracy loss and no fine-tuning required. At 4-bit quantization, attention computation on H100 GPUs runs up to 8x faster than the 32-bit baseline, with at least 6x memory reduction.

The technique combines two sub-algorithms. PolarQuant rotates vectors randomly, converts them to polar coordinates, and exploits the fact that the resulting coordinate distribution becomes highly concentrated — predictable enough to encode efficiently without the overhead that normally makes vector quantization expensive. QJL then applies a 1-bit Johnson-Lindenstrauss correction to remove bias from the quantized attention scores. The combination is validated on LongBench, Needle In A Haystack, and ZeroSCROLLS using Gemma and Mistral, matching full-precision performance at 4x compression.

What distinguishes TurboQuant from prior KV cache work (KIVI, SnapKV, PyramidKV) is the theoretical grounding: the paper proves information-theoretic lower bounds on achievable distortion rate and shows TurboQuant is within a small constant factor of optimal. That’s a stronger claim than most quantization papers make. The immediate application is at Google scale, likely for Gemini. No PyTorch/CUDA implementation is public yet.

At the opposite end of the scale: Hypura, a new open-source project posted to Hacker News today, solves a version of the same memory problem on a MacBook. A 32 GB M1 Max can’t naively load a 40 GB model — the OS will swap-thrash until the OOM killer intervenes. Hypura’s approach is to treat the Mac’s storage hierarchy as a first-class citizen: it distributes model tensors across GPU shared memory, RAM, and NVMe, and schedules loads based on access patterns rather than just loading what fits.

For MoE models like Mixtral, it exploits the sparsity directly — each token only needs 2 of 8 experts, and temporal locality gives a 99.5% cache hit rate for those expert weights. The result is Mixtral 8x7B running at 2.2 tokens per second on an M1 Max, and Llama 70B at 0.3 tok/s, where vanilla llama.cpp produces out-of-memory crashes. It exposes an Ollama-compatible HTTP API, so existing tooling works without changes.

These three things — a server CPU designed from scratch for agentic infrastructure, an algorithm that proves near-optimal compression for KV caches, and a scheduler that runs 70B models off NVMe on consumer hardware — share a common root problem. Memory bandwidth is the binding constraint for AI inference at every scale, from Arm’s 800 GB/s server chip to the 6.4 GB/s of NVMe on a MacBook Pro. The approaches differ by roughly five orders of magnitude in scale, but the people working on each of them are pulling on the same rope.

Memory Is The New Compute

Related

Leave a ReplyCancel reply

Teaching AI to Imagine Before It Acts

US Government Halts Anthropic’s AI Models Citing Security Fears, Sparks Industry Controversy

The Build Log That Spoke to AI Agents

Half a Billion Dollar AI Blunder: The Hidden Costs of Unchecked Tech Spending

ECC v2.0: Elevating Agentic Work with Versatile Operator Systems and Open-Source Innovation

The Vulnerability Bottleneck Has Moved

China’s First Real Gaming GPU Is Here — And That Matters More Than FPS

Shai-Hulud and the Danger of Trusted Packages

When the Future Remembers First