Two Papers and a Mystery Model

Two architecture papers landed on the Hacker News front page this week, independently, making the same underlying argument: the transformer as we’ve shipped it for the past few years has accumulated a lot of unexamined assumptions, and those assumptions are starting to cost us.

Things we do by habit

Residual connections, the skip connections that add a layer’s input back to its output, have been part of the deep learning toolkit since ResNet. In transformers they’re treated as essentially free infrastructure. You don’t really think about them; they’re just there.

The Kimi team at MoonshotAI published Attention Residuals this week (code) asking what happens if you replace those fixed, uniform additions with something learned. Their answer lets each layer selectively attend over all earlier layer outputs rather than simply accumulating them with unit weights. The Block AttnRes variant keeps memory overhead reasonable by grouping layers and attending over block-level summaries instead.

The results are modest but legible: equivalent performance to a baseline trained on 1.25x more compute. The gains concentrate on reasoning and code, with +7.5 percentage points on GPQA-Diamond and +3.1 percentage points on HumanEval. This makes sense. Tasks that need to retrieve information distributed across many earlier representations benefit most from selective, depth-aware aggregation. It’s a small intervention with a clear mechanism, which is rarer than it sounds in this field.

The second paper is Mamba-3, out of Carnegie Mellon, Princeton, and Together AI, accepted at ICLR 2026 (OpenReview). State space models have been perpetually almost-relevant for a couple of years. Mamba-3 takes a different design bet from its predecessors: instead of optimizing for training speed (Mamba-2’s main pitch), it targets inference efficiency. Complex-valued state tracking, a multi-input multi-output (MIMO) decoding structure, and a new discretization scheme combine to beat the transformer baseline by about 4% on language modeling while running up to 7x faster at long sequences. The benchmarks are specific and the paper is a peer-reviewed ICLR acceptance, not a marketing post.

Whether either of these becomes load-bearing infrastructure is genuinely uncertain, and architecture research has a long history of promising results that don’t survive scaling. But two papers in one week, both poking at different parts of the standard transformer stack, is a signal worth paying attention to.

The model that fooled everyone

On March 11, a model called Hunter Alpha appeared on OpenRouter with no attribution: no company, no paper, no press release. According to Xiaomi and subsequent reporting, it topped usage charts almost immediately and processed over a trillion tokens in its anonymous run. The AI internet quickly concluded it had to be DeepSeek V4: the chain-of-thought patterns matched, the knowledge cutoff matched, even the self-description as “a Chinese AI model” fit.

It wasn’t DeepSeek. Xiaomi confirmed on March 18 that Hunter Alpha was an internal test build of MiMo-V2-Pro, a trillion-parameter model with 42 billion active parameters, built by a team led by former DeepSeek researcher Luo Fuli (report). Today the company announced a full $8.7 billion commitment to AI infrastructure behind it (Caixin).

The model’s benchmarks are credible, with Xiaomi citing a top-10 global placement on the Artificial Analysis Intelligence Index at launch, and its agent-first positioning is more interesting than another chat model. The stealth launch strategy worked well, possibly not entirely by design: the confusion with DeepSeek meant the model got real usage and real scrutiny before anyone knew whose reputation was on the line. For an AI division that’s less than two years old, being genuinely mistaken for the frontier Chinese lab is a reasonable outcome.

A quiet note on translation

Less flashy but worth including: Meta published Omnilingual MT this week (paper), extending machine translation support to over 1,600 languages. The predecessor system covered about 200. Specialized 1B to 8B models match or exceed 70B general-purpose LLMs on translation quality. This is a clean example of why task specialization still beats raw scale for constrained problems.

The part that matters most isn’t the headline number. Meta is also releasing the evaluation infrastructure: BOUQuET, BLASER 3, OmniTOX. Low-resource language MT has historically suffered from a lack of good benchmarks as much as a lack of models. If the eval tooling holds up, this work has a realistic path to improving translation for languages that current systems handle poorly or not at all.

Two Papers and a Mystery Model

Things we do by habit

The model that fooled everyone

A quiet note on translation

Related

Leave a ReplyCancel reply

US Government Halts Anthropic’s AI Models Citing Security Fears, Sparks Industry Controversy

The Build Log That Spoke to AI Agents

Half a Billion Dollar AI Blunder: The Hidden Costs of Unchecked Tech Spending

ECC v2.0: Elevating Agentic Work with Versatile Operator Systems and Open-Source Innovation

The Vulnerability Bottleneck Has Moved

China’s First Real Gaming GPU Is Here — And That Matters More Than FPS

Shai-Hulud and the Danger of Trusted Packages

When the Future Remembers First

YellowKey Turns BitLocker Into an Open Door