2021
DOI: 10.48550/arxiv.2112.07658
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

A-ViT: Adaptive Tokens for Efficient Vision Transformer

Abstract: We introduce AdaViT, a method that adaptively adjusts the inference cost of vision transformer (ViT) for images of different complexity. AdaViT achieves this by automatically reducing the number of tokens in vision transformers that are processed in the network as inference proceeds. We reformulate Adaptive Computation Time (ACT [17]) for this task, extending halting to discard redundant spatial tokens. The appealing architectural properties of vision transformers enables our adaptive token reduction mechanism… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 21 publications
(33 reference statements)
0
2
0
Order By: Relevance
“…One of the limitations of our algorithm is that it requires the batch-wise masking scheme (as in Section 3.5) to achieve the best efficiency. Although this limitation only has little impact on the MIM pre-training, it restrains the application of our method on a broader range of settings, e.g., training ViTs with token sparification [53,68] that requires instance-wise sparsification. These applications are beyond the scope of this work and we will leave them for the future study.…”
Section: Discussionmentioning
confidence: 99%
“…One of the limitations of our algorithm is that it requires the batch-wise masking scheme (as in Section 3.5) to achieve the best efficiency. Although this limitation only has little impact on the MIM pre-training, it restrains the application of our method on a broader range of settings, e.g., training ViTs with token sparification [53,68] that requires instance-wise sparsification. These applications are beyond the scope of this work and we will leave them for the future study.…”
Section: Discussionmentioning
confidence: 99%
“…Rao et al [26] introduced a prediction module to score each patch and then pruned redundant patches hierarchically. Yin et al [27] reduced the inference cost by automatically minimizing the number of tokens. Despite the great results achieved by these approaches, they only focused on the classification/recognition tasks and reduced the computational complexity at the cost of minor performance degradation.…”
Section: Introductionmentioning
confidence: 99%