2022
DOI: 10.48550/arxiv.2207.05501
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios

Abstract: Due to the complex attention mechanisms and model design, most existing vision Transformers (ViTs) can not perform as efficiently as convolutional neural networks (CNNs) in realistic industrial deployment scenarios, e.g. TensorRT and CoreML. This poses a distinct challenge: Can a visual neural network be designed to infer as fast as CNNs and perform as powerful as ViTs? Recent works have tried to design CNN-Transformer hybrid architectures to address this issue, yet the overall performance of these works is fa… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
40
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
6
1
1

Relationship

0
8

Authors

Journals

citations
Cited by 32 publications
(40 citation statements)
references
References 27 publications
0
40
0
Order By: Relevance
“…The SR strategy in the PvT effectively reduced the amount of computations. In a recent report, inspired by the SR in the PvT, Li et al [ 48 ] proposed the Next-ViT, a new paradigm that fuses convolutional and Transformer modules during every stage, aiming to improve model efficiency and achieve industrial-scale deployment of the CNN-Transformer hybrid architecture. Therefore, how to achieve efficient computation in a multi-branch hybrid framework is also a problem that needs more research and experiments.…”
Section: Discussionmentioning
confidence: 99%
“…The SR strategy in the PvT effectively reduced the amount of computations. In a recent report, inspired by the SR in the PvT, Li et al [ 48 ] proposed the Next-ViT, a new paradigm that fuses convolutional and Transformer modules during every stage, aiming to improve model efficiency and achieve industrial-scale deployment of the CNN-Transformer hybrid architecture. Therefore, how to achieve efficient computation in a multi-branch hybrid framework is also a problem that needs more research and experiments.…”
Section: Discussionmentioning
confidence: 99%
“…For example, CNNs are applied at large resolution stages, while ViT blocks serve as bottlenecks (Liang et al 2021, Dalmaz et al 2022. However, downscaled spatial extents in these configurations may compromise the long-range context relationships of MSA and lead to performance saturation in downstream tasks (Li et al 2022). Other approaches have adopted successive stacking of convolutional and MSA operations (Wu et al 2021), (Wang et al 2021).…”
Section: Hybrid Cnn-transformer Networkmentioning
confidence: 99%
“…But the computational complexity of self-attention is quadratic with respect to image size, resulting in most existing ViTs cannot perform as efficiently as CNNs in realistic industrial deployment scenarios. To address this problem, Li et al 28 developed the Next-ViT that stacks efficient convolution block and transformer block in a novel strategy to build a powerful architecture for efficient deployment on both mobile devices and server graphic processing units.…”
Section: Introductionmentioning
confidence: 99%