2021
DOI: 10.48550/arxiv.2104.01136
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
90
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
3
2

Relationship

0
10

Authors

Journals

citations
Cited by 62 publications
(91 citation statements)
references
References 32 publications
1
90
0
Order By: Relevance
“…TransClaw-UNet achieves an absolute gain of 0.6 in dice score compared to Claw-UNet on Synapse multi-organ segmentation dataset and shows excellent generalization. Similarly, inspired from the LeViT [167], Xu et al [168] propose LeViT-UNet which aims to optimize the trade-off between accuracy and efficiency. LeViT-UNet is a multistage architecture that demonstrates good performance and generalization ability on Synapse and ACDC benchmarks.…”
Section: Hybrid Architecturesmentioning
confidence: 99%
“…TransClaw-UNet achieves an absolute gain of 0.6 in dice score compared to Claw-UNet on Synapse multi-organ segmentation dataset and shows excellent generalization. Similarly, inspired from the LeViT [167], Xu et al [168] propose LeViT-UNet which aims to optimize the trade-off between accuracy and efficiency. LeViT-UNet is a multistage architecture that demonstrates good performance and generalization ability on Synapse and ACDC benchmarks.…”
Section: Hybrid Architecturesmentioning
confidence: 99%
“…DeiT [38] improves the data efficiency of training ViT with a token distillation pipeline. Apart from the sequence-to-sequence structure, the efficiency of PVT [39] and Swin Transformer [30] sparks much interests in exploring the Hierarchical Vision Transformer (HVT) [14,22,41,44]. ViT is also extended to solve the low-level tasks and dense prediction problems [2,6,20].…”
Section: Vision Transformermentioning
confidence: 99%
“…Compared to traditional CNN structures that operate on a fixed-sized window with restricted spatial interactions (Raghu et al, 2021), ViT allows all the positions in an image to interact through transformer blocks. Since then, many variants have been proposed (Graham et al, 2021;Liu et al, 2021c;Yuan et al, 2021a;Wang et al, 2021b;Han et al, 2021;Wu et al, 2021;Chen et al, 2021b;Steiner et al, 2021;El-Nouby et al, 2021;Liu et al, 2021a;Wang et al, 2021a;Bao et al, 2021). For example, DeiT , T2T-ViT (Yuan et al, 2021b) and Mixer (Chen et al, 2021d) tackle the data-inefficiency problem and make ViT trainable only with ImageNet-1K (Deng et al, 2009).…”
Section: Vision Transformersmentioning
confidence: 99%