2021
DOI: 10.48550/arxiv.2111.14791
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Self-Supervised Pre-Training of Swin Transformers for 3D Medical Image Analysis

Abstract: Vision Transformers (ViT)s have shown great performance in self-supervised learning of global and local representations that can be transferred to downstream applications. Inspired by these results, we introduce a novel self-supervised learning framework with tailored proxy tasks for medical image analysis. Specifically, we propose: (i) a new 3D transformer-based model, dubbed Swin UNEt TRansformers (Swin UNETR), with a hierarchical encoder for self-supervised pre-training; (ii) tailored proxy tasks for learni… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
25
0
1

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
3
1

Relationship

3
5

Authors

Journals

citations
Cited by 19 publications
(26 citation statements)
references
References 34 publications
0
25
0
1
Order By: Relevance
“…In another work, Tang et al [163] introduce Swin UNETR, a novel self-supervised learning framework with proxy tasks to pre-train Transformer encoder on 5,050 images of CT dataset. They validate the effectiveness of pre-training by finetuning the Transformer encoder with a CNN-based decoder on the downstream task of MSD and BTCV segmentation datasets.…”
Section: Hybrid Architecturesmentioning
confidence: 99%
See 2 more Smart Citations
“…In another work, Tang et al [163] introduce Swin UNETR, a novel self-supervised learning framework with proxy tasks to pre-train Transformer encoder on 5,050 images of CT dataset. They validate the effectiveness of pre-training by finetuning the Transformer encoder with a CNN-based decoder on the downstream task of MSD and BTCV segmentation datasets.…”
Section: Hybrid Architecturesmentioning
confidence: 99%
“…TransUNet [96], CoTr* [112] (small CNN encoder compared to CoTr), CoTr [112], UNETR [35], and Swin UNETR [163]. Note: Avg: Average results (over 12 organs), AG: left and right adrenal glands, Pan: pancreas, Sto: stomach, Spl: spleen, Liv: liver, Gall: gallbladder.…”
Section: Multi-scale Architecturesmentioning
confidence: 99%
See 1 more Smart Citation
“…Specifically, self-attention module in ViT-based models allows for modeling long-range information by pairwise interaction between token embeddings and hence leading to more effective local and global contextual representations [34]. In addition, ViTs have achieved success in effective learning of pretext tasks for self-supervised pre-training in various applications [9,8,36]. In medical image analysis, UNETR [16] is the first methodology that utilizes a ViT as its encoder without relying on a CNN-based feature extractor.…”
Section: Introductionmentioning
confidence: 99%
“…Hierarchical models [5,20,24] are received significant interest in medical image analysis due to their advantages of modeling heterogeneous high-resolution radiography images. Recent works on vision transformers [8,18] show superior performance on visual representations compared to state-of-the-art convolution-based networks [12].…”
Section: Introductionmentioning
confidence: 99%