Width &amp; Depth Pruning for Vision Transformers

Fang, Yu; Huang, Kun; Wang, Meng; Cheng, Yuan; Chu, Wei; Cui, Li

doi:10.1609/aaai.v36i3.20222

Cited by 53 publications

(23 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The first group of methods focuses on reducing the complexity of the attention module by imposing the locality of input images adaptively [36], [37]. The second group applies pruning methods to remove unimportant components (e.g., partial channels) [38] or inputs (i.e., patches) [39] to a ViT model. The third group of methods uses neural architecture search (NAS) techniques to design efficient ViT models by optimizing architectural hyperparameters, such as channels and depth [40], [41].…”

Section: Vision Transformersmentioning

confidence: 99%

TFormer: A Transmission-Friendly ViT Model for IoT Devices

Ding

Juefei-Xu³

et al. 2023

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

Deploying high-performance vision transformer (ViT) models on ubiquitous Internet of Things (IoT) devices to provide high-quality vision services will revolutionize the way we live, work, and interact with the world. Due to the contradiction between the limited resources of IoT devices and resource-intensive ViT models, the use of cloud servers to assist ViT model training has become mainstream. However, due to the larger number of parameters and floating-point operations (FLOPs) of the existing ViT models, the model parameters transmitted by cloud servers are large and difficult to run on resource-constrained IoT devices. To this end, this paper proposes a transmission-friendly ViT model, TFormer, for deployment on resource-constrained IoT devices with the assistance of a cloud server. The high performance and small number of model parameters and FLOPs of TFormer are attributed to the proposed hybrid layer and the proposed partially connected feed-forward network (PCS-FFN). The hybrid layer consists of nonlearnable modules and a pointwise convolution, which can obtain multitype and multiscale features with only a few parameters and FLOPs to improve the TFormer performance. The PCS-FFN adopts group convolution to reduce the number of parameters. The key idea of this paper is to propose TFormer with few model parameters and FLOPs to facilitate applications running on resource-constrained IoT devices to benefit from the high performance of the ViT models. Experimental results on the ImageNet-1K, MS COCO, and ADE20K datasets for image classification, object detection, and semantic segmentation tasks demonstrate that the proposed model outperforms other state-of-the-art models. Specifically, TFormer-S achieves 5% higher accuracy on ImageNet-1K than ResNet18 with 1.4× fewer parameters and FLOPs.

show abstract

Section: Vision Transformersmentioning

confidence: 99%

TFormer: A Transmission-Friendly ViT Model for IoT Devices

Ding

Juefei-Xu³

et al. 2023

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

show abstract

“…Self-Attention Self-Attention Previous works [6,7,8,9] have addressed the redundancy of input image patches in vision transformers and have attempted to optimize patch slimming to enhance computation efficiency. Most of these approaches [7,8,9] have utilized the same paradigm, classifying image patches in each layer into two classes, one for keeping and another for discarding, using an attentive score.…”

Section: Self-attentionmentioning

confidence: 99%

Quadcopter Drone for Vision-Based Autonomous Target Following

et al. 2023

View full text Add to dashboard Cite

Unmanned aerial vehicles (UAVs) are becoming popular in various applications. However, there are still challenging issues to be tackled, such as effective obstacle avoidance, target identification within a crowd, and specific target tracking. This paper focuses on dynamic target following and obstacle avoidance to realize a prototype of a quadcopter drone to serve as an autonomous object follower. An adaptive target identification system is proposed to recognize the specific target in the complicated background. For obstacle avoidance during flight, we introduce an idea of space detection and use it to develop a so-called contour and spiral convolution space detection (CASCSD) algorithm to evade obstacles. Thanks to the low architecture complexity, it is appropriate for implementation on onboard flight control systems. The target prediction is integrated with fuzzified flight control to fulfill an autonomous target tracker. When this series of technical research and development is completed, this system can be used for applications such as personal security guard and criminal detection systems.

show abstract

“…To compress and accelerate transformer models, a variety of techniques naturally emerge. Popular approaches include weight quantization [34], knowledge distillation [31], filter compression [27], and model pruning [24]. Among them, model pruning especially structured pruning has gained considerable interest that removes the least important parameters in pre-trained models in a hardware-friendly manner, which is thus the focus of our paper.…”

Section: Introductionmentioning

confidence: 99%

“…(1) Criterion-based pruning resorts to preserving the most important weights/attentions by employing pre-defined criteria, e.g., the L1/L2 norm [23], or activation values [6]. (2) Training-based pruning retrains models with hand-crafted sparse regularizations [36] or resource constraints [31,32].…”

Section: Introductionmentioning

confidence: 99%

“…For the second issue, previous studies resort to identifying unimportant weights by various importance metrics, including magnitude-based, gradient-based [12,22], and mask-based [30,31]. Among them, the magnitude-based approaches usually lead to suboptimal results as it does not take into account the potential correlation between weights [27].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

X-Pruner: eXplainable Pruning for Vision Transformers

Lei¹,

Wang²

2023

Preprint

View full text Add to dashboard Cite

Recently vision transformer models have become prominent models for a range of tasks. These models, however, usually suffer from intensive computational costs, making them impractical for deployment on edge platforms. Recent studies have proposed to prune transformers in a series of criteria, such as magnitude-based, gradient-based, and mask-based. However, previous works rely heavily on hand-crafted rules and may involve time-consuming retraining or searching. As a result, measuring weight importance in an automatic and efficient way remains an open problem. To solve this problem, we propose a novel explainable pruning framework dubbed X-Pruner, by considering the explainability of the pruning criterion. Inspired by the model explanation, we propose to assign an explainability-aware mask for each prunable unit, which measures the unit's contribution to predicting every class and is fully differentiable. Then, to preserve the most informative units, we rank all units based on the absolute sum of their explainability-aware masks and using this ranking to prune enough units to meet the target resource constraint. To verify and evaluate our method, we apply the X-Pruner on representative transformer models including the DeiT and Swin Transformer. Comprehensive simulation results demonstrate that the proposed X-Pruner outperforms the state-of-the-art black-box methods with significantly reduced computational costs and slight performance degradation.

show abstract

Width & Depth Pruning for Vision Transformers

Cited by 53 publications

References 26 publications

TFormer: A Transmission-Friendly ViT Model for IoT Devices

TFormer: A Transmission-Friendly ViT Model for IoT Devices

Quadcopter Drone for Vision-Based Autonomous Target Following

X-Pruner: eXplainable Pruning for Vision Transformers

Contact Info

Product

Resources

About