2021
DOI: 10.48550/arxiv.2105.09121
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Single-Layer Vision Transformers for More Accurate Early Exits with Less Overhead

Abstract: Deploying deep learning models in time-critical applications with limited computational resources, for instance in edge computing systems and IoT networks, is a challenging task that often relies on dynamic inference methods such as early exiting. In this paper, we introduce a novel architecture for early exiting based on the vision transformer architecture, as well as a fine-tuning strategy that significantly increase the accuracy of early exit branches compared to conventional approaches while introducing le… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
3

Relationship

2
1

Authors

Journals

citations
Cited by 3 publications
(4 citation statements)
references
References 26 publications
0
4
0
Order By: Relevance
“…Note that while early exits have been recently attached to high-performing CNN backbones [10], [20], [21], there is no prior work for early exits on Vision Transformer backbones. Since the performance obtained by Vision Transformer backbones is improved by a large margin, we omit listing the comparison with early exits on CNN backbones in the following results.…”
Section: Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…Note that while early exits have been recently attached to high-performing CNN backbones [10], [20], [21], there is no prior work for early exits on Vision Transformer backbones. Since the performance obtained by Vision Transformer backbones is improved by a large margin, we omit listing the comparison with early exits on CNN backbones in the following results.…”
Section: Resultsmentioning
confidence: 99%
“…The second architecture called ViT-EE is shown in Figure 3 (b). This is perhaps the most intuitive architecture where P b is given as input to a Transformer encoder layer [10]. The output of the Transformer encoder is then normalized and passed on to an MLP, similar to the previous architecture.…”
Section: Multi-exit Vision Transformermentioning
confidence: 99%
“…Dynamic Inference with ViTs. To reduce the deployment costs of ViTs, several works (Wang et al 2021;Bakhtiarnia, Zhang, and Iosifidis 2021;Rao et al 2021;Meng et al 2022;Xu et al 2022;Uzkent, Yeh, and Ermon 2020;Uzkent and Ermon 2020) have been proposed to improve the processing speed via dynamically pruning the tokens/patches or skipping transformer components adaptively. Essentially as dynamic inference approaches, this set of works do not pursue to reduce the model sizes but focus on input-aware inference to obtain practical speedup.…”
Section: Introductionmentioning
confidence: 99%
“…a 63× reduction in time complexity for a one-block transformer) for shallow Transformers on common benchmarks in Online Action Detection [11] and Online Audio Classification [12]. Though our innovation is limited to one-and two-block Transformers, these have important uses in modern network design [13], [14]. We view these efficiency improvements as an important step towards widespread adoption of Transformers in real-time applications and on recourse constrained devices.…”
Section: Introductionmentioning
confidence: 99%