2022
DOI: 10.1145/3505244
|View full text |Cite
|
Sign up to set email alerts
|

Transformers in Vision: A Survey

Abstract: Astounding results from Transformer models on natural language tasks have intrigued the vision community to study their application to computer vision problems. Among their salient benefits, Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence as compared to recurrent networks e.g. , Long short-term memory (LSTM). Different from convolutional networks, Transformers require minimal inductive biases for their design and… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

5
612
0
5

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
4
1

Relationship

0
9

Authors

Journals

citations
Cited by 1,566 publications
(622 citation statements)
references
References 82 publications
5
612
0
5
Order By: Relevance
“…At this stage, we can conclude that using Resnet as deep learning backbone does not outperform humans in feature selection. This, however, can possibly be improved by using more sophisticated network architectures and pipelines such transformers 48 .…”
Section: Discussionmentioning
confidence: 99%
“…At this stage, we can conclude that using Resnet as deep learning backbone does not outperform humans in feature selection. This, however, can possibly be improved by using more sophisticated network architectures and pipelines such transformers 48 .…”
Section: Discussionmentioning
confidence: 99%
“…Chaudhari et al [141] provided a survey of attention models in deep neural networks which concentrates on their application to natural language processing, while our work focuses on computer vision. Three more specific surveys [142][143][144] summarize the development of visual transformers while our paper reviews attention mechanisms in vision more generally, not just selfattention mechanisms. Wang and Tax [145] presented a survey of attention models in computer vision, but it only considers RNN-based attention models, which form just a part of our survey.…”
Section: Other Surveysmentioning
confidence: 99%
“…A detailed survey of vision transformers is omitted here as other recent surveys [142][143][144]171] comprehensively review the use of transformer methods for visual tasks.…”
Section: Vision Transformersmentioning
confidence: 99%
“…This architecture was proposed in [18] for NLP tasks, outperforming recurrent and convolutional models, which were state-of-the-art at that moment. Recently, several works have been published looking for ways to apply them to computer vision tasks [19]. This architecture relies on a self-attention mechanism, but, instead of using recurrence, transformer models create relationships between all samples in the input sequence, allowing parallelization and better use of modern devices such as TPUs and GPUs.…”
Section: Related Workmentioning
confidence: 99%