Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Liu, Ze; Lin, Yutong; Cao, Yue; Hu, Han; Wei, Yixuan; Zhang, Zheng; Lin, Stephen; Guo, Baining

doi:10.1109/iccv48922.2021.00986

Cited by 16,164 publications

(7,577 citation statements)

References 31 publications

Supporting

Mentioning

5,776

Contrasting

Unclassified

Order By: Relevance

“…However, the contribution of MLP to the resulting characteristics requires further research. Such applications include, for example, MLP at the output of a convolutional neural network or numerous MLPs as part of modern architectures [2][3][4][5][6][7][8][9].…”

Section: Discussion Of the Results Of Studying The Influence Of The C...mentioning

confidence: 99%

“…Along with the development of convolutional neural networks, a large number of neural network architectures have appeared that provide the same classification quality as convolutional neural networks, but require less computation. These are networks such as MLP-Mixer (multilayer perceptron mixer) [2], Vision Transformer (ViT) [3], Compact Transformers [4], ConvMixer (Transformer using convolutions for mixing) [5], External Attention Transformer (Transformer with external attention) [6], FNet (Transformer using Fourier transform) [7], gMLP (MLPs with gating -multilayer perceptrons with element-wise multiplication) [8], Swin Transformer (Transformer with shifted windows) [ 9] and similar ones. Despite the variety of architectures, all of them have a multilayer perceptron (MLP) at the output, and besides this, inside the architecture.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Determination of the influence of the choice of the pruning procedure parameters on the learning quality of a multilayer perceptron

Galchonkov

Nevrev

Shevchuk

et al. 2022

EEJET

View full text Add to dashboard Cite

Pruning connections in a fully connected neural network allows to remove redundancy in the structure of the neural network and thus reduce the computational complexity of its implementation while maintaining the resulting characteristics of the classification of images entering its input. However, the issues of choosing the parameters of the pruning procedure have not been sufficiently studied at the moment. The choice essentially depends on the configuration of the neural network. However, in any neural network configuration there is one or more multilayer perceptrons. For them, it is possible to develop universal recommendations for choosing the parameters of the pruning procedure. One of the most promising methods for practical implementation is considered – the iterative pruning method, which uses preprocessing of input signals to regularize the learning process of a neural network. For a specific configuration of a multilayer perceptron and the MNIST (Modified National Institute of Standards and Technology) dataset, a database of handwritten digit samples proposed by the US National Institute of Standards and Technology as a standard when comparing image recognition methods, dependences of the classification accuracy of handwritten digits and learning rate were obtained on the learning step, pruning interval, and the number of links removed at each pruning iteration. It is shown that the best set of parameters of the learning procedure with pruning provides an increase in the quality of classification by about 1 %, compared with the worst set in the studied range. The convex nature of these dependencies allows a constructive approach to finding a neural network configuration that provides the highest classification accuracy with the minimum amount of computational costs during implementation.

show abstract

Section: Discussion Of the Results Of Studying The Influence Of The C...mentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Determination of the influence of the choice of the pruning procedure parameters on the learning quality of a multilayer perceptron

Galchonkov

Nevrev

Shevchuk

et al. 2022

EEJET

View full text Add to dashboard Cite

show abstract

“…CaiT [41] propose a layer-scalar to assure training a deeper network for better performance and LV-ViT [23] modified how the model is trained when CutMix [52] augmentation is applied on ViT. Another parallel thread for improving vision transformer is in incorporating CNN-like pyramid structure into ViT [48,43,19,54,27,15,30,7,50,6]. PVT [43] and PiT [19] introduce the pyramid structure of most CNN models into ViT, which makes them more suitable for objection detection as it can provide multiscale features.…”

Section: Related Workmentioning

confidence: 99%

“…Especially, the recently proposed Vision Transformers (ViT) [13] demonstrates comparable image classification results against the firmly established and prevalent CNNs [18,39,4] in computer vision, albeit relying on a huge amount of training data. It has since then led to an explosion of interest [48,43,19,54,27,15,30,7,50,6] in further investigating its potential for a wide variety of vision applications.…”

Section: Introductionmentioning

confidence: 99%

RegionViT: Regional-to-Local Attention for Vision Transformers

Chen¹,

Panda²,

Fan³

2021

Preprint

View full text Add to dashboard Cite

Vision transformer (ViT) has recently showed its strong capability in achieving comparable results to convolutional neural networks (CNNs) on image classification. However, vanilla ViT simply inherits the same architecture from the natural language processing directly, which is often not optimized for vision applications. Motivated by this, in this paper, we propose a new architecture that adopts the pyramid structure and employ a novel regional-to-local attention rather than global self-attention in vision transformers. More specifically, our model first generates regional tokens and local tokens from an image with different patch sizes, where each regional token is associated with a set of local tokens based on the spatial location. The regional-to-local attention includes two steps: first, the regional selfattention extract global information among all regional tokens and then the local self-attention exchanges the information among one regional token and the associated local tokens via self-attention. Therefore, even though local self-attention confines the scope in a local region but it can still receive global information. Extensive experiments on three vision tasks, including image classification, object detection and action recognition, show that our approach outperforms or is on par with state-of-the-art ViT variants including many concurrent works. Our source codes and models will be publicly available.Preprint. Under review.

show abstract

“…Other emerging techniques for image classification include vision transformer Dosovitskiy et al (2020) ; Touvron et al (2020) ; Liu et al (2021) and contrastive learning Wang et al (2020) ; Jaiswal et al (2021) . Vision transformer methods are based on the attention mechanism, where an input image is split into small patches and the vision transformer can learn to focus on the most important regions for classification.…”

Section: Introductionmentioning

confidence: 99%

Classification of Diabetic Foot Ulcers Using Class Knowledge Banks

Han

Zhou

et al. 2022

Front. Bioeng. Biotechnol.

View full text Add to dashboard Cite

Diabetic foot ulcers (DFUs) are one of the most common complications of diabetes. Identifying the presence of infection and ischemia in DFU is important for ulcer examination and treatment planning. Recently, the computerized classification of infection and ischaemia of DFU based on deep learning methods has shown promising performance. Most state-of-the-art DFU image classification methods employ deep neural networks, especially convolutional neural networks, to extract discriminative features, and predict class probabilities from the extracted features by fully connected neural networks. In the testing, the prediction depends on an individual input image and trained parameters, where knowledge in the training data is not explicitly utilized. To better utilize the knowledge in the training data, we propose class knowledge banks (CKBs) consisting of trainable units that can effectively extract and represent class knowledge. Each unit in a CKB is used to compute similarity with a representation extracted from an input image. The averaged similarity between units in the CKB and the representation can be regarded as the logit of the considered input. In this way, the prediction depends not only on input images and trained parameters in networks but the class knowledge extracted from the training data and stored in the CKBs. Experimental results show that the proposed method can effectively improve the performance of DFU infection and ischaemia classifications.

show abstract

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Cited by 16,164 publications

References 31 publications

Determination of the influence of the choice of the pruning procedure parameters on the learning quality of a multilayer perceptron

Determination of the influence of the choice of the pruning procedure parameters on the learning quality of a multilayer perceptron

RegionViT: Regional-to-Local Attention for Vision Transformers

Classification of Diabetic Foot Ulcers Using Class Knowledge Banks

Contact Info

Product

Resources

About