A-ViT: Adaptive Tokens for Efficient Vision Transformer

Yin, Hong; Vahdat, Arash; Álvarez, José David Ruiz; Mallya, Arun; Kautz, Jan; Molchanov, Pavlo

doi:10.48550/arxiv.2112.07658

Cited by 2 publications

(2 citation statements)

References 21 publications

(33 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…One of the limitations of our algorithm is that it requires the batch-wise masking scheme (as in Section 3.5) to achieve the best efficiency. Although this limitation only has little impact on the MIM pre-training, it restrains the application of our method on a broader range of settings, e.g., training ViTs with token sparification [53,68] that requires instance-wise sparsification. These applications are beyond the scope of this work and we will leave them for the future study.…”

Section: Discussionmentioning

confidence: 99%

Green Hierarchical Vision Transformer for Masked Image Modeling

Huang¹,

You²,

Zheng³

et al. 2022

Preprint

View full text Add to dashboard Cite

We present an efficient approach for Masked Image Modeling (MIM) with hierarchical Vision Transformers (ViTs), e.g., Swin Transformer [43], allowing the hierarchical ViTs to discard masked patches and operate only on the visible ones. Our approach consists of two key components. First, for the window attention, we design a Group Window Attention scheme following the Divide-and-Conquer strategy. To mitigate the quadratic complexity of the self-attention w.r.t. the number of patches, group attention encourages a uniform partition that visible patches within each local window of arbitrary size can be grouped with equal size, where masked self-attention is then performed within each group. Second, we further improve the grouping strategy via the Dynamic Programming algorithm to minimize the overall computation cost of the attention on the grouped patches. As a result, MIM now can work on hierarchical ViTs in a green and efficient way. For example, we can train the hierarchical ViTs about 2.7× faster and reduce the GPU memory usage by 70%, while still enjoying competitive performance on ImageNet classification and the superiority on downstream COCO object detection benchmarks. † * Corresponding author. † Code and pre-trained models: https://github.com/LayneH/GreenMIM.Preprint. Under review.

show abstract

Section: Discussionmentioning

confidence: 99%

Green Hierarchical Vision Transformer for Masked Image Modeling

Huang¹,

You²,

Zheng³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Rao et al [26] introduced a prediction module to score each patch and then pruned redundant patches hierarchically. Yin et al [27] reduced the inference cost by automatically minimizing the number of tokens. Despite the great results achieved by these approaches, they only focused on the classification/recognition tasks and reduced the computational complexity at the cost of minor performance degradation.…”

Section: Introductionmentioning

confidence: 99%

The Lighter The Better: Rethinking Transformers in Medical Image Segmentation Through Adaptive Pruning

Lin¹,

Liu²,

Cheng³

et al. 2022

Preprint

View full text Add to dashboard Cite

Vision transformers have recently set off a new wave in the field of medical image analysis due to their remarkable performance on various computer vision tasks. However, recent hybrid-/transformer-based approaches mainly focus on the benefits of transformers in capturing long-range dependency while ignoring the issues of their daunting computational complexity, high training costs, and redundant dependency. In this paper, we propose to employ adaptive pruning to transformers for medical image segmentation and propose a lightweight and effective hybrid network APFormer. To our best knowledge, this is the first work on transformer pruning for medical image analysis tasks. The key features of APFormer mainly are self-supervised self-attention (SSA) to improve the convergence of dependency establishment, Gaussian-prior relative position embedding (GRPE) to foster the learning of position information, and adaptive pruning to eliminate redundant computations and perception information. Specifically, SSA and GRPE consider the wellconverged dependency distribution and the Gaussian heatmap distribution separately as the prior knowledge of self-attention and position embedding to ease the training of transformers and lay a solid foundation for the following pruning operation. Then, adaptive transformer pruning, both query-wise and dependencywise, is performed by adjusting the gate control parameters for both complexity reduction and performance improvement. Extensive experiments on two widely-used datasets demonstrate the prominent segmentation performance of APFormer against the state-of-the-art methods with much fewer parameters and lower GFLOPs. More importantly, we prove, through ablation studies, that adaptive pruning can work as a plug-n-play module for performance improvement on other hybrid-/transformerbased methods. Code is available at https://github.com/xianlin7/ APFormer.

show abstract

A-ViT: Adaptive Tokens for Efficient Vision Transformer

Cited by 2 publications

References 21 publications

Green Hierarchical Vision Transformer for Masked Image Modeling

Green Hierarchical Vision Transformer for Masked Image Modeling

The Lighter The Better: Rethinking Transformers in Medical Image Segmentation Through Adaptive Pruning

Contact Info

Product

Resources

About