MetaFormer is Actually What You Need for Vision

Yu, Weihao; Luo, Mi; Zhou, Pan; Si, Chenyang; Zhou, Yichen; Wang, Xinchao; Feng, Jiashi; Yan, Shuicheng

doi:10.1109/cvpr52688.2022.01055

Cited by 596 publications

(249 citation statements)

References 25 publications

Supporting

Mentioning

247

Contrasting

Order By: Relevance

“…When equipped with RetinaNet 1x (Lin et al, 2017b) setting, ConvFormer-S achieves gains Table 5 Results of semantic segmentation on ADE20K (Zhou et al, 2019) validation set. The pre-trained ConvFormer-S is plugged in Semantic FPN (Kirillov et al, 2019a) and UperNet (Xiao et al, 2018) frameworks, and training/validation schemes are following (Yu et al, 2021) and (Liu et al, 2021c). Note that the difference in the number of parameters between two ConvFormer-S is due to the usage of semantic FPN and UperNet.…”

Section: Resultsmentioning

confidence: 99%

“…We follow the main training strategies in Pool-Former (Yu et al, 2021). Specifically, Cutmix (Yun et al, 2019), RandAugment (Cubuk et al, 2020), Mixup (Zhang et al, 2018), and Label Smoothing regularization (Szegedy et al, 2016) are applied to augment the training data.…”

Section: Methodsmentioning

confidence: 99%

“…Params. (M) GFLOPs Top-1 Acc (%) T2T-ViT t -14 (Yuan et al, 2021) 21.5 6.1 81.7 PVT-Small (Wang et al, 2021c) 24.5 3.8 79.8 TNT-S (Han et al, 2021a) 23.8 5.2 81.5 gMLP-S (Liu et al, 2021a) 20.0 4.5 79.6 Swin-T (Liu et al, 2021c) 28.3 4.5 81.3 PoolFormer-S24 (Yu et al, 2021) 21.4 3.6 80.3 ResMLP-24 (Touvron et al, 2021a) 30.0 6.0 79.4 Twins-SVT-S (Chu et al, 2021) 24.0 2.8 81.7 GFNet-S (Rao et al, 2021) 25.0 4.5 80.0 PVTv2-B2 (Wang et al, 2021a) 25.4 4.0 82.0 Focal-T (Yang et al, 2021) 29.1 4.9 82.2 ConvNeXt-T (Liu et al, 2022) 28 AdamW (Loshchilov and Hutter, 2018) optimizer, a total batch size of 16 on 8 GPUs. The initial learning rate is set as 1 × 10 −4 .…”

Section: Methodsmentioning

confidence: 99%

“…The initial learning rate is set as 1 × 10 −4 . We adopt 1× training schedule (12 epochs) in PVT (Wang et al, 2021b) and PoolFormer (Yu et al, 2021) to train all detection models. Our implementation is based on the mmdetection (Chen et al, 2019) codebase.…”

Section: Methodsmentioning

confidence: 99%

See 3 more Smart Citations

DMFormer: Closing the Gap Between CNN and Vision Transformers

Wei¹,

Pan²,

Niu³

et al. 2022

Preprint

View full text Add to dashboard Cite

Vision transformers have shown excellent performance in computer vision tasks. However, the computation cost of their (local) self-attention mechanism is expensive. Comparatively, CNN is more efficient with built-in inductive bias. Recent works show that CNN is promising to compete with vision transformers by learning their architecture design and training protocols. Nevertheless, existing methods either ignore multi-level features or lack dynamic prosperity, leading to sub-optimal performance. In this paper, we propose a novel attention mechanism named MCA, which captures different patterns of input images by multiple kernel sizes and enables input-adaptive weights with a gating mechanism. Based on MCA, we present a neural network named ConvFormer. ConvFormer adopts the general architecture of vision transformers, while replacing the (local) self-attention mechanism with our proposed MCA. Extensive experimental results demonstrated that ConvFormer achieves state-of-the-art performance on ImageNet classification, which outperforms similar-sized vision transformers(ViTs) and convolutional neural networks (CNNs). Moreover, for object detection on COCO and semantic segmentation tasks on ADE20K, ConvFormer also shows excellent performance compared with recently advanced methods. Code and models will be available.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

See 2 more Smart Citations

DMFormer: Closing the Gap Between CNN and Vision Transformers

Wei¹,

Pan²,

Niu³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Although Jet Image uses CNNs, we use a relatively new model, gated MLP (gMLP) [5] in this study, because the performance of newer models such as Vision Transformer (ViT) [6] and MLPMixer [7] improved significantly over the past few years. These new models are based on a similar structure called the Metaformer [8].…”

Section: Related Workmentioning

confidence: 99%

Study for jet flavor tagging by using machine learning

Morinaga¹,

Saito²,

Tanaka³

et al. 2022

Proceedings of International Symposium on Grids &Amp; Clouds 2022 — PoS(ISGC2022)

View full text Add to dashboard Cite

In particle collisions like the Large Hadron Collider (LHC), a large number of physical objects, called jets, are created. They originated from hadrons, gluons, or quarks, and it is important to identify their origin. For example, a b-jet produced from a bottom quark has features, which can be used for its identification called a "b-tagging" algorithm, enabling precise measurement of the Higgs boson and search for other new particles from the beyond the standard model. Machine learning models have been proposed by various researchers to identify jet flavors, but only for specific flavor classification, e.g., classification of the bottom quark and other quarks/gluons (btagging), or classification of quarks and gluons (quark and gluon separation). In this study, we propose a method and show its results, where we extend the classification to all flavors except top quark: b/c/s/d/u/g at once using a modern method based on a recently developed training strategy for image recognition models.

show abstract

Polyp2Seg: Improved Polyp Segmentation with Vision Transformer

Mandujano-Cornejo

Montoya-Zegarra

2022

Medical Image Understanding and Analysis

View full text Add to dashboard Cite

MetaFormer is Actually What You Need for Vision

Cited by 596 publications

References 25 publications

DMFormer: Closing the Gap Between CNN and Vision Transformers

DMFormer: Closing the Gap Between CNN and Vision Transformers

Study for jet flavor tagging by using machine learning

Polyp2Seg: Improved Polyp Segmentation with Vision Transformer

Contact Info

Product

Resources

About