Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios

Li, Jiashi; Xia, Xinhui; Li, Wei; Li, Huixia; Wang, Xing; Xiao, Xuefeng; Wang, Rui; Zheng, Min; Pan, Xin

doi:10.48550/arxiv.2207.05501

Cited by 32 publications

(40 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The SR strategy in the PvT effectively reduced the amount of computations. In a recent report, inspired by the SR in the PvT, Li et al [ 48 ] proposed the Next-ViT, a new paradigm that fuses convolutional and Transformer modules during every stage, aiming to improve model efficiency and achieve industrial-scale deployment of the CNN-Transformer hybrid architecture. Therefore, how to achieve efficient computation in a multi-branch hybrid framework is also a problem that needs more research and experiments.…”

Section: Discussionmentioning

confidence: 99%

PHF3 Technique: A Pyramid Hybrid Feature Fusion Framework for Severity Classification of Ulcerative Colitis Using Endoscopic Images

Ruan

Liu

et al. 2022

Bioengineering

View full text Add to dashboard Cite

Evaluating the severity of ulcerative colitis (UC) through the Mayo endoscopic subscore (MES) is crucial for understanding patient conditions and providing effective treatment. However, UC lesions present different characteristics in endoscopic images, exacerbating interclass similarities and intraclass differences in MES classification. In addition, inexperience and review fatigue in endoscopists introduces nontrivial challenges to the reliability and repeatability of MES evaluations. In this paper, we propose a pyramid hybrid feature fusion framework (PHF3) as an auxiliary diagnostic tool for clinical UC severity classification. Specifically, the PHF3 model has a dual-branch hybrid architecture with ResNet50 and a pyramid vision Transformer (PvT), where the local features extracted by ResNet50 represent the relationship between the intestinal wall at the near-shot point and its depth, and the global representations modeled by the PvT capture similar information in the cross-section of the intestinal cavity. Furthermore, a feature fusion module (FFM) is designed to combine local features with global representations, while second-order pooling (SOP) is applied to enhance discriminative information in the classification process. The experimental results show that, compared with existing methods, the proposed PHF3 model has competitive performance. The area under the receiver operating characteristic curve (AUC) of MES 0, MES 1, MES 2, and MES 3 reached 0.996, 0.972, 0.967, and 0.990, respectively, and the overall accuracy reached 88.91%. Thus, our proposed method is valuable for developing an auxiliary assessment system for UC severity.

show abstract

Section: Discussionmentioning

confidence: 99%

PHF3 Technique: A Pyramid Hybrid Feature Fusion Framework for Severity Classification of Ulcerative Colitis Using Endoscopic Images

Ruan

Liu

et al. 2022

Bioengineering

View full text Add to dashboard Cite

show abstract

“…For example, CNNs are applied at large resolution stages, while ViT blocks serve as bottlenecks (Liang et al 2021, Dalmaz et al 2022. However, downscaled spatial extents in these configurations may compromise the long-range context relationships of MSA and lead to performance saturation in downstream tasks (Li et al 2022). Other approaches have adopted successive stacking of convolutional and MSA operations (Wu et al 2021), (Wang et al 2021).…”

Section: Hybrid Cnn-transformer Networkmentioning

confidence: 99%

A unified hybrid transformer for joint MRI sequences super-resolution and missing data imputation

Wang¹,

Hu²,

Yu³

et al. 2023

Phys. Med. Biol.

View full text Add to dashboard Cite

Objective: High-resolution (HR) multi-modal magnetic resonance imaging (MRI) is crucial in clinical practice for accurate diagnosis and treatment. However, challenges such as budget constraints, potential contrast agent deposition, and image corruption often limit the acquisition of multiple sequences from a single patient. Therefore, the development of novel methods to reconstruct under-sampled images and synthesize missing sequences is crucial for clinical and research applications. Approach: In this paper, we propose a unified hybrid framework called SIFormer, which utilizes any available low-resolution (LR) MRI contrast configurations to complete super-resolution (SR) of poor-quality MR images and impute missing sequences simultaneously in one forward process. SIFormer consists of a hybrid generator and a convolution-based discriminator. The generator incorporates two key blocks. First, the dual branch attention (DBA) block combines the long-range dependency building capability of the transformer with the high-frequency local information capture capability of the convolutional neural network (CNN) in a channel-wise split manner. Second, we introduce a learnable gating adaptation multi-layer perception (GA MLP) in the feed-forward block to optimize information transmission efficiently. Main Results: Comparative evaluations against six state-of-the-art methods demonstrate that SIFormer achieves enhanced quantitative performance and produces more visually pleasing results for image SR and synthesis tasks across multiple datasets. Significance: Extensive experiments conducted on multi-center multi-contrast MRI datasets, including both healthy individuals and brain tumor patients, highlight the potential of our proposed method to serve as a valuable supplement to MRI sequence acquisition in clinical and research settings.

show abstract

“…But the computational complexity of self-attention is quadratic with respect to image size, resulting in most existing ViTs cannot perform as efficiently as CNNs in realistic industrial deployment scenarios. To address this problem, Li et al 28 developed the Next-ViT that stacks efficient convolution block and transformer block in a novel strategy to build a powerful architecture for efficient deployment on both mobile devices and server graphic processing units.…”

Section: Introductionmentioning

confidence: 99%

Pixel-level detection of multiple pavement distresses and surface design features with ShuttleNetV2

Zhang

et al. 2023

Structural Health Monitoring

View full text Add to dashboard Cite

Concurrently detecting multiple objects of interest will yield massive time savings in processing and enable a more streamlined and unified detection system. The ShuttleNet is designed to repeat the encoding–decoding round freely or even endlessly, achieving prodigious successes in terms of simultaneous detection of multiple pavement distresses and surface design features on asphalt pavements. This paper proposes an efficient and improved architecture of ShuttleNet called ShuttleNetV2 for enhanced global modeling and retrieving fine details capabilities. The proposed ShuttleNetV2 represents two major modifications on the original ShuttleNet. On the one hand, the self-attention mechanism is purposefully introduced to capture long-range dependency. On the other hand, ShuttleNetV2 adopts various sampling scales to combine the characteristics of different receptive fields. The experimental results indicate that the recommended architectural variation of the proposed ShuttleNetV2 model yields a mean F-measure of 94.21% and a mean intersection-over-union of 0.8914 on 1500 pairs of testing images. The proposed ShuttleNetV2 outperforms ShuttleNet in detecting nearly all types of pavement patterns. In particular, ShuttleNetV2 efficaciously tackles the tangible limitations of ShuttleNet in detecting giant distresses. Moreover, the ShuttleNetV2 can process an image in roughly 78 ms using modern graphic processing unit devices, which has a promising potential in supporting the real-time detection of multiple pavement distresses and surface design features on asphalt pavements.

show abstract

Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios

Cited by 32 publications

References 27 publications

PHF3 Technique: A Pyramid Hybrid Feature Fusion Framework for Severity Classification of Ulcerative Colitis Using Endoscopic Images

PHF3 Technique: A Pyramid Hybrid Feature Fusion Framework for Severity Classification of Ulcerative Colitis Using Endoscopic Images

A unified hybrid transformer for joint MRI sequences super-resolution and missing data imputation

Pixel-level detection of multiple pavement distresses and surface design features with ShuttleNetV2

Contact Info

Product

Resources

About