Simplifying Vision Transformers with ReLU Attention

A new paper from researchers at DeepMind explores replacing the softmax function in transformer attention with a ReLU activation. The work shows that this simple change allows vision transformers to achieve comparable performance to traditional softmax attention on image classification tasks.

Transformers have become a dominant architecture across natural language processing and computer vision. At their core is an attention mechanism which builds contextual representations by aggregating information from the entire input sequence. This attention uses a softmax normalization to compute a probability distribution over sequence elements.

While effective, the softmax has major computational downsides. It requires expensive exponentiation and normalization over the sequence length at each layer. This prevents easy parallelization and limits scalability.

In the paper, the authors experiment with replacing softmax with a pointwise ReLU activation. Critically, they divide the ReLU output by the sequence length. This simple “ReLU attention” gives expected attention weights that are still normalized, avoiding the need to change other hyperparameters.

Results for small to large vision transformers trained on ImageNet-21k for 30 epochs. ImageNet-1k accuracy for ImageNet-21k models by taking the top class among those that are in ImageNet-1k, without fine-tuning.

Experiments across vision transformers from small to large scales indicate that ReLU attention matches the accuracy and scaling trends of softmax attention. The authors also demonstrate strong performance on several transfer learning benchmarks. This suggests the representations learned by ReLU transformers transfer broadly.

ReLU attention enables trivially parallelizing the attention computation over sequence length. As model sizes continue to grow, this property will become increasingly important to reduce memory costs and enable training.

Overall, this work shows a promising path to simplifying and speeding up transformer architectures. It will be interesting to see if ReLU attention can become a default across NLP and computer vision models in the future. The findings also motivate further research into what components of attention are truly essential for strong performance.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.