2021
DOI: 10.48550/arxiv.2111.11418
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

MetaFormer Is Actually What You Need for Vision

Abstract: Transformers have shown great potential in computer vision tasks. A common belief is their attention-based token mixer module contributes most to their competence. However, recent works show the attention-based module in transformers can be replaced by spatial MLPs and the resulted models still perform quite well. Based on this observation, we hypothesize that the general architecture of the transformers, instead of the specific token mixer module, is more essential to the model's performance. To verify this, … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
59
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
3
1

Relationship

2
6

Authors

Journals

citations
Cited by 32 publications
(59 citation statements)
references
References 48 publications
0
59
0
Order By: Relevance
“…These differences in mechanism lead to significant improvement of efficiency and performance as well. Another closely related work is Poolformer [74] which uses a pooling to summarize the local context and a simple subtraction to adjust the individual inputs. Though achieving decent efficiency for its simplicity, Poolformer lags behind popular vision transformers like Swin on performance.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…These differences in mechanism lead to significant improvement of efficiency and performance as well. Another closely related work is Poolformer [74] which uses a pooling to summarize the local context and a simple subtraction to adjust the individual inputs. Though achieving decent efficiency for its simplicity, Poolformer lags behind popular vision transformers like Swin on performance.…”
Section: Related Workmentioning
confidence: 99%
“…( 4) is motivated by its desirable properties. Compared to pooling [74,29], depth-wise convolution is learnable and structure-aware. In contrast to regular convolution, it is channel-wise and thus computationally much cheaper.…”
Section: Context Aggregation Via Mmentioning
confidence: 99%
See 1 more Smart Citation
“…We evaluate the effectiveness of Mugs through ViT [23,39] and thus will take ViT as an example to introduce Mugs. This is because 1) with similar model size, ViT shows better performance than CNN [49,39,28,58]; 2) ViT shows great potential for unifying the vision and language models [28,2].…”
Section: Multi-granular Self-supervised Learningmentioning
confidence: 99%
“…Transformer models [48], [49], [50], [51] have recently achieved excellent performance on a wide range of language and computer vision tasks, e.g., machine translation [52], image recognition [53], video understanding [54], visual question answering [55], etc. Generally, the success of Transformer can be attributed to its selfsupervision and self-attention [50].…”
Section: Transformer Modelmentioning
confidence: 99%