2021
DOI: 10.48550/arxiv.2105.02723
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet

Luke Melas-Kyriazi

Abstract: The strong performance of vision transformers on image classification and other vision tasks is often attributed to the design of their multi-head attention layers. However, the extent to which attention is responsible for this strong performance remains unclear. In this short report, we ask: is the attention layer even necessary? Specifically, we replace the attention layer in a vision transformer with a feed-forward layer applied over the patch dimension. The resulting architecture is simply a series of feed… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
65
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
1

Relationship

0
6

Authors

Journals

citations
Cited by 39 publications
(65 citation statements)
references
References 7 publications
0
65
0
Order By: Relevance
“…In this section, we detailedly review the structure of the latest so-called pioneering MLP model, MLP-Mixer [36], and followed by a brief review of the contemporaneous ResMLP [40] as well as the Feed-forward [37]. After that, we strip the new paradigm, MLP, from the network and elaborate its differences and connections with convolution and self-attentive mechanisms.…”
Section: Pioneering Model and New Paradigmmentioning
confidence: 99%
See 4 more Smart Citations
“…In this section, we detailedly review the structure of the latest so-called pioneering MLP model, MLP-Mixer [36], and followed by a brief review of the contemporaneous ResMLP [40] as well as the Feed-forward [37]. After that, we strip the new paradigm, MLP, from the network and elaborate its differences and connections with convolution and self-attentive mechanisms.…”
Section: Pioneering Model and New Paradigmmentioning
confidence: 99%
“…Here σ is an element-wise nonlinearity (GELU [79]), and LayerNorm(•) denotes the layer normalization [48] widely used in Transformer-based models. W 3 ∈ R rC×C represents weights of a fully-connected layer increasing the feature Compared to MLP-Mixer, Feed-forward (FF) [37] and ResMLP [40] are put on arXiv 2 a few days later. Feedforward [37] adopts essentially the same structure as the MLP-Mixer, just swaps the order of Channel-mixing MLP and Token-mixing MLP, and is not repeated here.…”
Section: Structure Of Pioneering Modelmentioning
confidence: 99%
See 3 more Smart Citations