2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021
DOI: 10.1109/iccv48922.2021.00986
|View full text |Cite
|
Sign up to set email alerts
|

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

5
5,776
1
10

Year Published

2021
2021
2024
2024

Publication Types

Select...
6
2

Relationship

0
8

Authors

Journals

citations
Cited by 16,164 publications
(7,577 citation statements)
references
References 31 publications
5
5,776
1
10
Order By: Relevance
“…However, the contribution of MLP to the resulting characteristics requires further research. Such applications include, for example, MLP at the output of a convolutional neural network or numerous MLPs as part of modern architectures [2][3][4][5][6][7][8][9].…”
Section: Discussion Of the Results Of Studying The Influence Of The C...mentioning
confidence: 99%
See 1 more Smart Citation
“…However, the contribution of MLP to the resulting characteristics requires further research. Such applications include, for example, MLP at the output of a convolutional neural network or numerous MLPs as part of modern architectures [2][3][4][5][6][7][8][9].…”
Section: Discussion Of the Results Of Studying The Influence Of The C...mentioning
confidence: 99%
“…Along with the development of convolutional neural networks, a large number of neural network architectures have appeared that provide the same classification quality as convolutional neural networks, but require less computation. These are networks such as MLP-Mixer (multilayer perceptron mixer) [2], Vision Transformer (ViT) [3], Compact Transformers [4], ConvMixer (Transformer using convolutions for mixing) [5], External Attention Transformer (Transformer with external attention) [6], FNet (Transformer using Fourier transform) [7], gMLP (MLPs with gating -multilayer perceptrons with element-wise multiplication) [8], Swin Transformer (Transformer with shifted windows) [ 9] and similar ones. Despite the variety of architectures, all of them have a multilayer perceptron (MLP) at the output, and besides this, inside the architecture.…”
Section: Introductionmentioning
confidence: 99%
“…CaiT [41] propose a layer-scalar to assure training a deeper network for better performance and LV-ViT [23] modified how the model is trained when CutMix [52] augmentation is applied on ViT. Another parallel thread for improving vision transformer is in incorporating CNN-like pyramid structure into ViT [48,43,19,54,27,15,30,7,50,6]. PVT [43] and PiT [19] introduce the pyramid structure of most CNN models into ViT, which makes them more suitable for objection detection as it can provide multiscale features.…”
Section: Related Workmentioning
confidence: 99%
“…Especially, the recently proposed Vision Transformers (ViT) [13] demonstrates comparable image classification results against the firmly established and prevalent CNNs [18,39,4] in computer vision, albeit relying on a huge amount of training data. It has since then led to an explosion of interest [48,43,19,54,27,15,30,7,50,6] in further investigating its potential for a wide variety of vision applications.…”
Section: Introductionmentioning
confidence: 99%
“…Other emerging techniques for image classification include vision transformer Dosovitskiy et al (2020) ; Touvron et al (2020) ; Liu et al (2021) and contrastive learning Wang et al (2020) ; Jaiswal et al (2021) . Vision transformer methods are based on the attention mechanism, where an input image is split into small patches and the vision transformer can learn to focus on the most important regions for classification.…”
Section: Introductionmentioning
confidence: 99%