Training BatchNorm Only in Neural Architecture Search and Beyond

Zhu, Yichen; Du, Jie; Zhu, Yuqin; Wang, Yi; Ou, Zhicai; Feng, Feifei; Tang, Jian

doi:10.48550/arxiv.2112.00265

Cited by 1 publication

(1 citation statement)

References 44 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, computational efficiency is critical in real-world scenarios, where the executed computation is translated into power consumption or carbon emission. Many works have tried on reducing the computational cost of CNNs via neural architecture search [10,16,25,54,57], knowledge distillation [20,55], dynamic routing [4,13,43,51,56] and pruning [15,18], but how to accelerate the ViT model have been rarely explored.…”

Section: Model Compressionmentioning

confidence: 99%

Make A Long Image Short: Adaptive Token Length for Vision Transformers

Zhu¹,

Zhu²,

Du³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

The vision transformer splits each image into a sequence of tokens with fixed length and processes the tokens in the same way as words in natural language processing. More tokens normally lead to better performance but considerably increased computational cost. Motivated by the proverb "A picture is worth a thousand words" we aim to accelerate the ViT model by making a long image short. To this end, we propose a novel approach to assign token length adaptively during inference. Specifically, we first train a ViT model, called Resizable-ViT (ReViT), that can process any given input with diverse token lengths. Then, we retrieve the "token-length label" from ReViT and use it to train a lightweight Token-Length Assigner (TLA). The token-length labels are the smallest number of tokens to split an image that the ReViT can make the correct prediction, and TLA is learned to allocate the optimal token length based on these labels. The TLA enables the ReViT to process the image with the minimum sufficient number of tokens during inference. Thus, the inference speed is boosted by reducing the token numbers in the ViT model. Our approach is general and compatible with modern vision transformer architectures and can significantly reduce computational expanse. We verified the effectiveness of our methods on multiple representative ViT models (DeiT [39], LV-ViT [22], and TimesFormer [3]) across two tasks (image classification and action recognition).

show abstract