CMT: Convolutional Neural Networks Meet Vision Transformers

Guo, Jianyuan; Han, Kai; Wu, Han; Xu, Chang; Xu, Chunjing; Wang, Yunhe

doi:10.48550/arxiv.2107.06263

Cited by 51 publications

(67 citation statements)

References 63 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Many efforts have been made to incorporate features of convolutional networks into vision transformers and vice versa. Self-attention can emulate convolution (Cordonnier et al, 2019) and can be initialized or regularized to be like it (d'Ascoli et al, 2021); other works simply add convolution operations to transformers (Dai et al, 2021;Guo et al, 2021), or include downsampling to be more like traditional pyramid-shaped convolutional networks . Conversely, self-attention or attention-like operations can supplement or replace convolution in ResNet-style models Ramachandran et al, 2019;Bello, 2021).…”

Section: R Wmentioning

confidence: 99%

Patches Are All You Need?

Trockman¹,

Kolter²

2022

Preprint

View full text Add to dashboard Cite

Section: R Wmentioning

confidence: 99%

Patches Are All You Need?

Trockman¹,

Kolter²

2022

Preprint

View full text Add to dashboard Cite

“…Recently, the pioneering work ViT [22] successfully applies the pure transformer-based architecture to computer vision, revealing the potential of transformer in handling visual tasks. Lots of follow-up studies are proposed [4,5,9,12,18,21,23,24,[27][28][29]31,38,41,43,45,50,52,56,76,77,80,81,84]. Many of them analyze the ViT [15,17,26,32,44,55,69,73,75,82] and improve it via introducing locality to earlier layers [11,17,48,64,79,83,87].…”

Section: Related Workmentioning

confidence: 99%

ELSA: Enhanced Local Self-Attention for Vision Transformer

Zhou¹,

Wang²,

Wang³

et al. 2021

Preprint

View full text Add to dashboard Cite

Self-attention is powerful in modeling long-range dependencies, but it is weak in local finer-level feature learning. The performance of local self-attention (LSA) is just on par with convolution and inferior to dynamic filters, which puzzles researchers on whether to use LSA or its counterparts, which one is better, and what makes LSA mediocre. To clarify these, we comprehensively investigate LSA and its counterparts from two sides: channel setting and spatial processing. We find that the devil lies in the generation and application of spatial attention, where relative position embeddings and the neighboring filter application are key factors. Based on these findings, we propose the enhanced local self-attention (ELSA) with Hadamard attention and the ghost head. Hadamard attention introduces the Hadamard product to efficiently generate attention in the neighboring case, while maintaining the high-order mapping. The ghost head combines attention maps with static matrices to increase channel capacity. Experiments demonstrate the effectiveness of ELSA. Without architecture / hyperparameter modification, drop-in replacing LSA with ELSA boosts Swin Transformer [48] by up to +1.4 on top-1 accuracy. ELSA also consistently benefits VOLO [83] from D1 to D5, where ELSA-VOLO-D5 achieves 87.2 on the ImageNet-1K without extra training images. In addition, we evaluate ELSA in downstream tasks. ELSA significantly improves the baseline by up to +1.9 box Ap / +1.3 mask Ap on the COCO, and by up to +1.9 mIoU on the ADE20K. Code is available at https://github.com/damo-cv/ELSA.

show abstract

“…Specifically, DeiT (Touvron et al 2020) adopts several training techniques (e.g. truncated normal initialization, strong data augmentation and smaller weight decay) and uses distillation to extend ViT to a data-efficient version; T2T ViT (Yuan et al 2021b), CeiT (Yuan et al 2021a), and CvT (Wu et al 2021) try to deal with the rigid patch division by introducing convolution operation for patch sequence generation to facilitate the training; DeepViT (Zhou et al 2021a) (Heo et al 2021), CeiT (Yuan et al 2021a), LocalViT (Li et al 2021b) and Visformer (Chen et al 2021b) introduce convolutional bias to speedup the training; LV-ViT (Jiang et al 2021) adopts several techniques including MixToken and Token Labeling for better training and feature generation; the SAM optimizer (Foret et al 2020) is adopted in (Chen, Hsieh, and Gong 2021) to better train ViTs without strong data augmentation; KVT (Wang et al 2021a) introduces the k-NN attention to filters out irrelevant tokens to speedup the training; conv-stem is adopted in several works (Graham et al 2021;Xiao et al 2021;Guo et al 2021;Yuan et al 2021c) to improve the robustness of training ViTs. In this paper, we investigate the training of ViTs by using the conv-stem and demonstrate several properties of conv-stem in the context of vision transformers, both theoretically and empirically.…”

Section: Related Workmentioning

confidence: 99%

“…The reasons for training difficulty are empirically analysed in (Xiao et al 2021), and the authors conjecture that the issue lies with the patchify stem of ViT models and propose that early convolutions help transformers see better. Recent works (Graham et al 2021;Guo et al 2021;Yuan et al 2021c) also introduce the conv-stem to improve the robustness of training vision transformer, but they lack the deep analysis why such conv-stem works.…”

Section: Introductionmentioning

confidence: 99%

Scaled ReLU Matters for Training Vision Transformers

Wang¹,

Wang²,

Luo³

et al. 2021

Preprint

View full text Add to dashboard Cite

Vision transformers (ViTs) have been an alternative design paradigm to convolutional neural networks (CNNs). However, the training of ViTs is much harder than CNNs, as it is sensitive to the training parameters, such as learning rate, optimizer and warmup epoch. The reasons for training difficulty are empirically analysed in (Xiao et al. 2021), and the authors conjecture that the issue lies with the patchify-stem of ViT models and propose that early convolutions help transformers see better. In this paper, we further investigate this problem and extend the above conclusion: only early convolutions do not help for stable training, but the scaled ReLU operation in the convolutional stem (conv-stem) matters. We verify, both theoretically and empirically, that scaled ReLU in conv-stem not only improves training stabilization, but also increases the diversity of patch tokens, thus boosting peak performance with a large margin via adding few parameters and flops. In addition, extensive experiments are conducted to demonstrate that previous ViTs are far from being well trained, further showing that ViTs have great potential to be a better substitute of CNNs.

show abstract

CMT: Convolutional Neural Networks Meet Vision Transformers

Cited by 51 publications

References 63 publications

Patches Are All You Need?

Patches Are All You Need?

ELSA: Enhanced Local Self-Attention for Vision Transformer

Scaled ReLU Matters for Training Vision Transformers

Contact Info

Product

Resources

About