DEJAVU: 6x faster transformers' inference

In the YouTube video “Sparse LLMs at inference: 6x faster transformers! | DEJAVU paper explained,” the speaker delves into the Deja Vu paper, which introduces a method to make sparse Large Language Models (LLMs) during inference while maintaining their performance. Transformer-based LLMs are known to be slow due to expensive self-attention and MLP layers. The Deja Vu paper’s inventions aim to make LLMs more efficient by making them sparse, although the challenge lies in the optimization of modern hardware for dense networks. The authors of the paper propose a contextual sparsity approach to turn off components based on input, preserving the model’s in-context learning ability. The sparsity method resulted in significant runtime cost savings for OPT-175B, making it 6 times faster than the standard transformer implementation. The authors of the paper also observed contextual sparsity in attention blocks, comparing it to mean-shift clustering. Each self-attention head performs one mean-shift clustering step, pushing tokens together and creating heavy-hitter attention heads representing important interactions. Heavy hitters display contextual sparsity as they cater to specific input needs, and this behavior is linked to mean-shift clustering, where denser regions gain more weight, resulting in stronger bonds and accumulating higher attention scores.

DEJAVU: 6x faster transformers’ inference

Related

Leave a ReplyCancel reply

When Code Training Goes Wrong: The Surprising Case of Emergent AI Misalignment

The Energy Infrastructure Gap That Could Decide the AI Race

AI-Powered Security Checks: Filtering Bots Without Slowing Users

Inside the Underground World of LLM Jailbreaks

GPT-5 is Here, and It’s Not What You Expected

The AI Agent That Actually Knows How to Build ML Models

Qwen-Image: Finally, an AI That Can Actually Write

Perplexity’s Stealth Crawling Sparks Debate Over AI Web Ethics

Feeding Your Gut to Fight Fat: How Tryptophan Sparks Hormone Recovery