New Soft Mixture-of-Experts Model Sets New Benchmarks for Image Classification

A new paper from researchers at Google DeepMind proposes Soft Mixture-of-Experts (Soft MoE), a novel sparse transformer architecture for image classification. The paper shows that Soft MoE significantly outperforms both standard Vision Transformers and popular sparse mixture-of-experts methods across various benchmarks.

Mixture-of-experts models aim to increase model capacity without increasing computational costs by routing different input tokens through different expert modules. However, most prior works rely on complex discrete routing algorithms that can be unstable and inefficient.

The key innovation in Soft MoE is a differentiable routing algorithm that mixes input tokens into weighted combinations before passing them to experts. This soft routing completely avoids the optimization challenges of discrete routing.

Experiments demonstrate Soft MoE’s capabilities:

Soft MoE models strongly dominate Vision Transformers and other sparse methods on the Pareto frontier of performance vs training cost. For example, Soft MoE-B/16 outperforms ViT-L/16 while requiring 3x fewer FLOPs.
When matched for training time, Soft MoE-B/16 surpasses ViT-H/14 on upstream metrics while being 5.7x faster at inference. This demonstrates massive gains in efficiency.
Soft MoE-L/16 beats ViT-H/14 upstream while having 3x lower inference cost. The largest Soft MoE models substantially improve over all ViTs.
Soft MoE benefits transfer to other modalities. A frozen Soft MoE image model paired with a text tower outperforms ViT counterparts on image-text retrieval.

The results clearly demonstrate that the continuous routing in Soft MoE enables scaling sparse Transformers beyond what was possible with prior discrete algorithms. The approach also simplifies training.

By unlocking much greater efficiency, Soft MoE could expand adoption of gigantic multimodal models. The ability to cheaply serve huge models after pre-training may prove particularly impactful.

While promising, there are still challenges to address before Soft MoE sees large-scale deployment. The reliance on model parallelism introduces complexity, and the approach does not directly apply to auto-regressive decoding tasks common in NLP. Nonetheless, Soft MoE represents a major advance that establishes new state-of-the-art benchmarks for sparse Transformers.

New Soft Mixture-of-Experts Model Sets New Benchmarks for Image Classification

Related

Leave a ReplyCancel reply

When Code Training Goes Wrong: The Surprising Case of Emergent AI Misalignment

The Energy Infrastructure Gap That Could Decide the AI Race

AI-Powered Security Checks: Filtering Bots Without Slowing Users

Inside the Underground World of LLM Jailbreaks

GPT-5 is Here, and It’s Not What You Expected

The AI Agent That Actually Knows How to Build ML Models

Qwen-Image: Finally, an AI That Can Actually Write

Perplexity’s Stealth Crawling Sparks Debate Over AI Web Ethics

Feeding Your Gut to Fight Fat: How Tryptophan Sparks Hormone Recovery