2021
DOI: 10.48550/arxiv.2107.06263
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

CMT: Convolutional Neural Networks Meet Vision Transformers

Abstract: Vision transformers have been successfully applied to image recognition tasks due to their ability to capture long-range dependencies within an image. However, there are still gaps in both performance and computational cost between transformers and existing convolutional neural networks (CNNs). In this paper, we aim to address this issue and develop a network that can outperform not only the canonical transformers, but also the high-performance convolutional models. We propose a new transformer based hybrid ne… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
67
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 51 publications
(67 citation statements)
references
References 63 publications
0
67
0
Order By: Relevance
“…Many efforts have been made to incorporate features of convolutional networks into vision transformers and vice versa. Self-attention can emulate convolution (Cordonnier et al, 2019) and can be initialized or regularized to be like it (d'Ascoli et al, 2021); other works simply add convolution operations to transformers (Dai et al, 2021;Guo et al, 2021), or include downsampling to be more like traditional pyramid-shaped convolutional networks . Conversely, self-attention or attention-like operations can supplement or replace convolution in ResNet-style models Ramachandran et al, 2019;Bello, 2021).…”
Section: R Wmentioning
confidence: 99%
“…Many efforts have been made to incorporate features of convolutional networks into vision transformers and vice versa. Self-attention can emulate convolution (Cordonnier et al, 2019) and can be initialized or regularized to be like it (d'Ascoli et al, 2021); other works simply add convolution operations to transformers (Dai et al, 2021;Guo et al, 2021), or include downsampling to be more like traditional pyramid-shaped convolutional networks . Conversely, self-attention or attention-like operations can supplement or replace convolution in ResNet-style models Ramachandran et al, 2019;Bello, 2021).…”
Section: R Wmentioning
confidence: 99%
“…Recently, the pioneering work ViT [22] successfully applies the pure transformer-based architecture to computer vision, revealing the potential of transformer in handling visual tasks. Lots of follow-up studies are proposed [4,5,9,12,18,21,23,24,[27][28][29]31,38,41,43,45,50,52,56,76,77,80,81,84]. Many of them analyze the ViT [15,17,26,32,44,55,69,73,75,82] and improve it via introducing locality to earlier layers [11,17,48,64,79,83,87].…”
Section: Related Workmentioning
confidence: 99%
“…Specifically, DeiT (Touvron et al 2020) adopts several training techniques (e.g. truncated normal initialization, strong data augmentation and smaller weight decay) and uses distillation to extend ViT to a data-efficient version; T2T ViT (Yuan et al 2021b), CeiT (Yuan et al 2021a), and CvT (Wu et al 2021) try to deal with the rigid patch division by introducing convolution operation for patch sequence generation to facilitate the training; DeepViT (Zhou et al 2021a) (Heo et al 2021), CeiT (Yuan et al 2021a), LocalViT (Li et al 2021b) and Visformer (Chen et al 2021b) introduce convolutional bias to speedup the training; LV-ViT (Jiang et al 2021) adopts several techniques including MixToken and Token Labeling for better training and feature generation; the SAM optimizer (Foret et al 2020) is adopted in (Chen, Hsieh, and Gong 2021) to better train ViTs without strong data augmentation; KVT (Wang et al 2021a) introduces the k-NN attention to filters out irrelevant tokens to speedup the training; conv-stem is adopted in several works (Graham et al 2021;Xiao et al 2021;Guo et al 2021;Yuan et al 2021c) to improve the robustness of training ViTs. In this paper, we investigate the training of ViTs by using the conv-stem and demonstrate several properties of conv-stem in the context of vision transformers, both theoretically and empirically.…”
Section: Related Workmentioning
confidence: 99%
“…The reasons for training difficulty are empirically analysed in (Xiao et al 2021), and the authors conjecture that the issue lies with the patchify stem of ViT models and propose that early convolutions help transformers see better. Recent works (Graham et al 2021;Guo et al 2021;Yuan et al 2021c) also introduce the conv-stem to improve the robustness of training vision transformer, but they lack the deep analysis why such conv-stem works.…”
Section: Introductionmentioning
confidence: 99%