LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference

Graham, Ben; El-Nouby, Alaaeldin; Touvron, Hugo; Stock, Pierre; Joulin, Armand; Jeǵou, Hervé; Douze, Matthijs

doi:10.48550/arxiv.2104.01136

Cited by 62 publications

(91 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…TransClaw-UNet achieves an absolute gain of 0.6 in dice score compared to Claw-UNet on Synapse multi-organ segmentation dataset and shows excellent generalization. Similarly, inspired from the LeViT [167], Xu et al [168] propose LeViT-UNet which aims to optimize the trade-off between accuracy and efficiency. LeViT-UNet is a multistage architecture that demonstrates good performance and generalization ability on Synapse and ACDC benchmarks.…”

Section: Hybrid Architecturesmentioning

confidence: 99%

Transformers in Medical Imaging: A Survey

Shamshad¹,

Khan²,

Zamir³

et al. 2022

Preprint

View full text Add to dashboard Cite

Following unprecedented success on the natural language tasks, Transformers have been successfully applied to several computer vision problems, achieving state-of-the-art results and prompting researchers to reconsider the supremacy of convolutional neural networks (CNNs) as de facto operators. Capitalizing on these advances in computer vision, the medical imaging field has also witnessed growing interest for Transformers that can capture global context compared to CNNs with local receptive fields. Inspired from this transition, in this survey, we attempt to provide a comprehensive review of the applications of Transformers in medical imaging covering various aspects, ranging from recently proposed architectural designs to unsolved issues. Specifically, we survey the use of Transformers in medical image segmentation, detection, classification, reconstruction, synthesis, registration, clinical report generation, and other tasks. In particular, for each of these applications, we develop taxonomy, identify application-specific challenges as well as provide insights to solve them, and highlight recent trends. Further, we provide a critical discussion of the field's current state as a whole, including the identification of key challenges, open problems, and outlining promising future directions. We hope this survey will ignite further interest in the community and provide researchers with an up-to-date reference regarding applications of Transformer models in medical imaging. Finally, to cope with the rapid development in this field, we intend to regularly update the relevant latest papers and their open-source implementations at https://github.com/fahadshamshad/awesome-transformers-in-medical-imaging.

show abstract

Section: Hybrid Architecturesmentioning

confidence: 99%

Transformers in Medical Imaging: A Survey

Shamshad¹,

Khan²,

Zamir³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…DeiT [38] improves the data efficiency of training ViT with a token distillation pipeline. Apart from the sequence-to-sequence structure, the efficiency of PVT [39] and Swin Transformer [30] sparks much interests in exploring the Hierarchical Vision Transformer (HVT) [14,22,41,44]. ViT is also extended to solve the low-level tasks and dense prediction problems [2,6,20].…”

Section: Vision Transformermentioning

confidence: 99%

Lawin Transformer: Improving Semantic Segmentation Transformer with Multi-Scale Representations via Large Window Attention

Yan¹,

Zhang²,

Wu³

2022

Preprint

View full text Add to dashboard Cite

Multi-scale representations are crucial for semantic segmentation. The community has witnessed the flourish of semantic segmentation convolutional neural networks (CNN) exploiting multi-scale contextual information. Motivated by that the vision transformer (ViT) is powerful in image classification, some semantic segmentation ViTs are recently proposed, most of them attaining impressive results but at a cost of computational economy. In this paper, we succeed in introducing multi-scale representations into semantic segmentation ViT via window attention mechanism and further improves the performance and efficiency. To this end, we introduce large window attention which allows the local window to query a larger area of context window at only a little computation overhead. By regulating the ratio of the context area to the query area, we enable the large window attention to capture the contextual information at multiple scales. Moreover, the framework of spatial pyramid pooling is adopted to collaborate with the large window attention, which presents a novel decoder named large window attention spatial pyramid pooling (LawinASPP) for semantic segmentation ViT. Our resulting ViT, Lawin Transformer, is composed of an efficient hierachical vision transformer (HVT) as encoder and a LawinASPP as decoder. The empirical results demonstrate that Lawin Transformer offers an improved efficiency compared to the existing method. Lawin Transformer further sets new state-of-the-art performance on Cityscapes (84.4% mIoU), ADE20K (56.2% mIoU) and COCO-Stuff datasets. The code will be released at https://github.com/yan-hao-tian/lawin.

show abstract

“…Compared to traditional CNN structures that operate on a fixed-sized window with restricted spatial interactions (Raghu et al, 2021), ViT allows all the positions in an image to interact through transformer blocks. Since then, many variants have been proposed (Graham et al, 2021;Liu et al, 2021c;Yuan et al, 2021a;Wang et al, 2021b;Han et al, 2021;Wu et al, 2021;Chen et al, 2021b;Steiner et al, 2021;El-Nouby et al, 2021;Liu et al, 2021a;Wang et al, 2021a;Bao et al, 2021). For example, DeiT , T2T-ViT (Yuan et al, 2021b) and Mixer (Chen et al, 2021d) tackle the data-inefficiency problem and make ViT trainable only with ImageNet-1K (Deng et al, 2009).…”

Section: Vision Transformersmentioning

confidence: 99%

VAQF: Fully Automatic Software-Hardware Co-Design Framework for Low-Bit Vision Transformer

Sun¹,

Ma²,

Kang³

et al. 2022

Preprint

View full text Add to dashboard Cite

The transformer architectures with attention mechanisms have obtained success in Nature Language Processing (NLP), and Vision Transformers (ViTs) have recently extended the application domains to various vision tasks. While achieving high performance, ViTs suffer from large model size and high computation complexity that hinders the deployment of them on edge devices. To achieve high throughput on hardware and preserve the model accuracy simultaneously, we propose VAQF, a framework that builds inference accelerators on FPGA platforms for quantized ViTs with binary weights and low-precision activations. Given the model structure and the desired frame rate, VAQF will automatically output the required quantization precision for activations as well as the optimized parameter settings of the accelerator that fulfill the hardware requirements. The implementations are developed with Vivado High-Level Synthesis (HLS) on the Xilinx ZCU102 FPGA board, and the evaluation results with the DeiT-base model indicate that a frame rate requirement of 24 frames per second (FPS) is satisfied with 8-bit activation quantization, and a target of 30 FPS is met with 6-bit activation quantization. To the best of our knowledge, this is the first time quantization has been incorporated into ViT acceleration on FPGAs with the help of a fully automatic framework to guide the quantization strategy on the software side and the accelerator implementations on the hardware side given the target frame rate. Very small compilation time cost is incurred compared with quantization training, and the generated accelerators show the capability of achieving real-time execution for state-of-the-art ViT models on FPGAs.

show abstract

LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference

Cited by 62 publications

References 32 publications

Transformers in Medical Imaging: A Survey

Transformers in Medical Imaging: A Survey

Lawin Transformer: Improving Semantic Segmentation Transformer with Multi-Scale Representations via Large Window Attention

VAQF: Fully Automatic Software-Hardware Co-Design Framework for Low-Bit Vision Transformer

Contact Info

Product

Resources

About