Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Liu, Ze; Lin, Yutong; Cao, Yue; Hu, Han; Wei, Yixuan; Zhang, Zheng; Lin, Stephen; Guo, Baining

doi:10.48550/arxiv.2103.14030

Cited by 853 publications

(1,758 citation statements)

References 52 publications

Supporting

Mentioning

1,752

Contrasting

Order By: Relevance

“…Many ViT variants were proposed in recent months. Swin Transformer [21] applied the shifted window approach to compute self-attention matrix. Wang et al proposed PVT-based model (PVTv1 & v2) [34,35], which built a progressive shrinking pyramid and a spatial-reduction attention layer to generate multi-resolution feature maps.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Training Vision Transformers with Only 2040 Images

Cao¹,

Yu²,

Wu³

2022

Preprint

View full text Add to dashboard Cite

Vision Transformers (ViTs) is emerging as an alternative to convolutional neural networks (CNNs) for visual recognition. They achieve competitive results with CNNs but the lack of the typical convolutional inductive bias makes them more data-hungry than common CNNs. They are often pretrained on JFT-300M or at least ImageNet and few works study training ViTs with limited data. In this paper, we investigate how to train ViTs with limited data (e.g., 2040 images). We give theoretical analyses that our method (based on parametric instance discrimination) is superior to other methods in that it can capture both feature alignment and instance similarities. We achieve state-of-the-art results when training from scratch on 7 small datasets under various ViT backbones. We also investigate the transferring ability of small datasets and find that representations learned from small datasets can even improve large-scale ImageNet training.

show abstract

Section: Related Workmentioning

confidence: 99%

“…In order to alleviate this problem, a lot of works try to introduce convolutions to ViTs [21,35,37,40]. These architectures enjoy the advantages of both paradigms, with attention layers modeling long-range dependencies while convolutions emphasizing the local properties of images.…”

Section: Introductionmentioning

confidence: 99%

Training Vision Transformers with Only 2040 Images

Cao¹,

Yu²,

Wu³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Beyond classification, PVT [16] introduces a pyramid structure in Transformer, demonstrating the potential of a pure transformer backbone compared to CNN counterparts in dense prediction tasks. After that, methods such as Swin [17], CvT [18] and Twins [19] enhance the local continuity of features and remove fixed size position embedding to improve the performance of Transformers in dense prediction tasks.…”

Section: Related Workmentioning

confidence: 99%

“…To reduce the complexity of this process for large resolutions, the sequence reduction process [17] is used, this process uses a reduction ratio R to reduce the length of the sequence of as follows:…”

Section: Transformer-based Encodermentioning

confidence: 99%

CMC2R: Cross‐modal collaborative contextual representation for RGBT tracking

Liu

Luo²,

Yan

et al. 2022

IET Image Processing

View full text Add to dashboard Cite

The key challenge in RBGT tracking is how to fuse dual-modality information to build a robust RGB-T tracker. Motivated by CNN structure for local features, and visual transformer structure for global representations, the authors propose a two-stream hybrid structure, termed CMC 2 R, to take advantage of convolutional operations and self-attention mechanisms to lean the enhanced representation. CMC 2 R fuses local features and global representations under different resolutions through the transformer layer of the encoder block, and the two modalities are collaborated to get contextual information by the spatial and channel self-attention. The temporal association is performed with the track query, each track query models the entire track of an object, and updated frame-by-frame to build the long-range temporal relation. Experimental results show the effectiveness of the proposed method, and achieve the SOTAs performance.

show abstract

“…Inspired by the recent success of vision transformer networks (Zhang et al, 2019;Carion et al, 2020;Zeng et al, 2020;Dosovitskiy et al, 2020;Esser et al, 2021;Liu et al, 2021;Hudson & Zitnick, 2021;Touvron et al, 2020;, we make a step towards a more practical scenario in which we only assume access to pre-trained models on public computer vision datasets, and a relatively small medical dataset, which we can use the weights of the pre-trained models to achieve higher accuracy in the medical image analysis tasks. These settings are particularly appealing as (1) such models can be easily adopted on typical medical datasets;…”

Section: Introductionmentioning

confidence: 99%

Class-Aware Generative Adversarial Transformers for Medical Image Segmentation

You,

Zhao,

Liu

et al. 2022

Preprint

View full text Add to dashboard Cite

Transformers have made remarkable progress towards modeling long-range dependencies within the medical image analysis domain. However, current transformer-based models suffer from several disadvantages: (1) existing methods fail to capture the important features of the images due to the naive tokenization scheme; (2) the models suffer from information loss because they only consider single-scale feature representations; and (3) the segmentation label maps generated by the models are not accurate enough without considering rich semantic contexts and anatomical textures. In this work, we present CA-GANformer, a novel type of generative adversarial transformers, for medical image segmentation. First, we take advantage of the pyramid structure to construct multi-scale representations and handle multi-scale variations. We then design a novel class-aware transformer module to better learn the discriminative regions of objects with semantic structures. Lastly, we utilize an adversarial training strategy that boosts segmentation accuracy and correspondingly allows a transformer-based discriminator to capture high-level semantically correlated contents and low-level anatomical features. Our experiments demonstrate that CA-GANformer dramatically outperforms previous state-of-the-art transformerbased approaches on three benchmarks, obtaining 2.54%-5.88% absolute improvements in Dice over previous models. Further qualitative experiments provide a more detailed picture of the model's inner workings, shed light on the challenges in improved transparency, and demonstrate that transfer learning can greatly improve performance and reduce the size of medical image datasets in training, making CA-GANformer a strong starting point for downstream medical im-

show abstract

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Cited by 853 publications

References 52 publications

Training Vision Transformers with Only 2040 Images

Training Vision Transformers with Only 2040 Images

CMC2R: Cross‐modal collaborative contextual representation for RGBT tracking

Class-Aware Generative Adversarial Transformers for Medical Image Segmentation

Contact Info

Product

Resources

About