2021
DOI: 10.48550/arxiv.2103.14030
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Abstract: This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, we propose a hierarchical Transformer whose representation is computed with shifted windows. The shifted win… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

4
1,752
2

Year Published

2022
2022
2024
2024

Publication Types

Select...
6
2

Relationship

0
8

Authors

Journals

citations
Cited by 853 publications
(1,758 citation statements)
references
References 52 publications
4
1,752
2
Order By: Relevance
“…Many ViT variants were proposed in recent months. Swin Transformer [21] applied the shifted window approach to compute self-attention matrix. Wang et al proposed PVT-based model (PVTv1 & v2) [34,35], which built a progressive shrinking pyramid and a spatial-reduction attention layer to generate multi-resolution feature maps.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Many ViT variants were proposed in recent months. Swin Transformer [21] applied the shifted window approach to compute self-attention matrix. Wang et al proposed PVT-based model (PVTv1 & v2) [34,35], which built a progressive shrinking pyramid and a spatial-reduction attention layer to generate multi-resolution feature maps.…”
Section: Related Workmentioning
confidence: 99%
“…In order to alleviate this problem, a lot of works try to introduce convolutions to ViTs [21,35,37,40]. These architectures enjoy the advantages of both paradigms, with attention layers modeling long-range dependencies while convolutions emphasizing the local properties of images.…”
Section: Introductionmentioning
confidence: 99%
“…Beyond classification, PVT [16] introduces a pyramid structure in Transformer, demonstrating the potential of a pure transformer backbone compared to CNN counterparts in dense prediction tasks. After that, methods such as Swin [17], CvT [18] and Twins [19] enhance the local continuity of features and remove fixed size position embedding to improve the performance of Transformers in dense prediction tasks.…”
Section: Related Workmentioning
confidence: 99%
“…To reduce the complexity of this process for large resolutions, the sequence reduction process [17] is used, this process uses a reduction ratio R to reduce the length of the sequence of as follows:…”
Section: Transformer-based Encodermentioning
confidence: 99%
“…Inspired by the recent success of vision transformer networks (Zhang et al, 2019;Carion et al, 2020;Zeng et al, 2020;Dosovitskiy et al, 2020;Esser et al, 2021;Liu et al, 2021;Hudson & Zitnick, 2021;Touvron et al, 2020;, we make a step towards a more practical scenario in which we only assume access to pre-trained models on public computer vision datasets, and a relatively small medical dataset, which we can use the weights of the pre-trained models to achieve higher accuracy in the medical image analysis tasks. These settings are particularly appealing as (1) such models can be easily adopted on typical medical datasets;…”
Section: Introductionmentioning
confidence: 99%