Single-Layer Vision Transformers for More Accurate Early Exits with Less Overhead

Bakhtiarnia, Arian; Zhang, Qi; Iosifidis, Alexandros

doi:10.48550/arxiv.2105.09121

Cited by 3 publications

(4 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Note that while early exits have been recently attached to high-performing CNN backbones [10], [20], [21], there is no prior work for early exits on Vision Transformer backbones. Since the performance obtained by Vision Transformer backbones is improved by a large margin, we omit listing the comparison with early exits on CNN backbones in the following results.…”

Section: Resultsmentioning

confidence: 99%

See 1 more Smart Citation

Multi-Exit Vision Transformer for Dynamic Inference

Bakhtiarnia¹,

Zhang²,

Iosifidis³

2021

Preprint

Self Cite

View full text Add to dashboard Cite

Deep neural networks can be converted to multiexit architectures by inserting early exit branches after some of their intermediate layers. This allows their inference process to become dynamic, which is useful for time critical IoT applications with stringent latency requirements, but with time-variant communication and computation resources. In particular, in edge computing systems and IoT networks where the exact computation time budget is variable and not known beforehand. Vision Transformer is a recently proposed architecture which has since found many applications across various domains of computer vision. In this work, we propose seven different architectures for early exit branches that can be used for dynamic inference in Vision Transformer backbones. Through extensive experiments involving both classification and regression problems, we show that each one of our proposed architectures could prove useful in the trade-off between accuracy and speed.

show abstract

Section: Resultsmentioning

confidence: 99%

“…The second architecture called ViT-EE is shown in Figure 3 (b). This is perhaps the most intuitive architecture where P b is given as input to a Transformer encoder layer [10]. The output of the Transformer encoder is then normalized and passed on to an MLP, similar to the previous architecture.…”

Section: Multi-exit Vision Transformermentioning

confidence: 99%

Multi-Exit Vision Transformer for Dynamic Inference

Bakhtiarnia¹,

Zhang²,

Iosifidis³

2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Dynamic Inference with ViTs. To reduce the deployment costs of ViTs, several works (Wang et al 2021;Bakhtiarnia, Zhang, and Iosifidis 2021;Rao et al 2021;Meng et al 2022;Xu et al 2022;Uzkent, Yeh, and Ermon 2020;Uzkent and Ermon 2020) have been proposed to improve the processing speed via dynamically pruning the tokens/patches or skipping transformer components adaptively. Essentially as dynamic inference approaches, this set of works do not pursue to reduce the model sizes but focus on input-aware inference to obtain practical speedup.…”

Section: Introductionmentioning

confidence: 99%

GOHSP: A Unified Framework of Graph and Optimization-based Heterogeneous Structured Pruning for Vision Transformer

Yin¹,

Uzkent²,

Shen³

et al. 2023

Preprint

View full text Add to dashboard Cite

The recently proposed Vision transformers (ViTs) have shown very impressive empirical performance in various computer vision tasks, and they are viewed as an important type of foundation model. However, ViTs are typically constructed with large-scale sizes, which then severely hinder their potential deployment in many practical resourcesconstrained applications. To mitigate this challenging problem, structured pruning is a promising solution to compress model size and enable practical efficiency. However, unlike its current popularity for CNNs and RNNs, structured pruning for ViT models is little explored. In this paper, we propose GOHSP, a unified framework of Graph and Optimization-based Structured Pruning for ViT models. We first develop a graph-based ranking for measuring the importance of attention heads, and the extracted importance information is further integrated to an optimizationbased procedure to impose the heterogeneous structured sparsity patterns on the ViT models. Experimental results show that our proposed GOHSP demonstrates excellent compression performance. On CIFAR-10 dataset, our approach can bring 40% parameters reduction with no accuracy loss for ViT-Small model. On ImageNet dataset, with 30% and 35% sparsity ratio for DeiT-Tiny and DeiT-Small models, our approach achieves 1.65% and 0.76% accuracy increase over the existing structured pruning methods, respectively.

show abstract

“…a 63× reduction in time complexity for a one-block transformer) for shallow Transformers on common benchmarks in Online Action Detection [11] and Online Audio Classification [12]. Though our innovation is limited to one-and two-block Transformers, these have important uses in modern network design [13], [14]. We view these efficiency improvements as an important step towards widespread adoption of Transformers in real-time applications and on recourse constrained devices.…”

Section: Introductionmentioning

confidence: 99%

Continual Transformers: Redundancy-Free Attention for Online Inference

Hedegaard¹,

Bakhtiarnia²,

Iosifidis³

2022

Preprint

Self Cite

View full text Add to dashboard Cite

Transformers are attention-based sequence transduction models, which have found widespread success in Natural Language Processing and Computer Vision applications. Yet, Transformers in their current form are inherently limited to operate on whole token sequences rather than on one token at a time. Consequently, their use during online inference entails considerable redundancy due to the overlap in successive token sequences. In this work, we propose novel formulations of the Scaled Dot-Product Attention, which enable Transformers to perform efficient online token-by-token inference in a continual input stream. Importantly, our modification is purely to the order of computations, while the produced outputs and learned weights are identical to those of the original Multi-Head Attention. To validate our approach, we conduct experiments on visual, audio, and audio-visual classification and detection tasks, i.e. Online Action Detection on THUMOS14 and TVSeries and Online Audio Classification on GTZAN, with remarkable results. Our continual one-block transformers reduce the floating point operations by respectively 63.5× and 51.5× in the Online Action Detection and Audio Classification experiments at similar predictive performance.

show abstract

Single-Layer Vision Transformers for More Accurate Early Exits with Less Overhead

Cited by 3 publications

References 26 publications

Multi-Exit Vision Transformer for Dynamic Inference

Multi-Exit Vision Transformer for Dynamic Inference

GOHSP: A Unified Framework of Graph and Optimization-based Heterogeneous Structured Pruning for Vision Transformer

Continual Transformers: Redundancy-Free Attention for Online Inference

Contact Info

Product

Resources

About