EdgeBERT: Sentence-Level Energy Optimizations for Latency-Aware Multi-Task NLP Inference

Tambe, Thierry; Hooper, Coleman; Pentecost, Lillian; Jia, Tianyu; Yang, En-Yu; Donato, Marco; Sanh, Victor; Whatmough, Paul N.; Rush, Alexander M.; Brooks, David; Wei, Gu-Yeon

doi:10.48550/arxiv.2011.14203

Cited by 2 publications

(1 citation statement)

References 65 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Other great ways of improving the efficiency of transformers include weight sharing across transformer blocks (Lan et al, 2019), dynamically controlling the attention span of each token Tambe et al, 2020), and allowing the model to output the result in an earlier transformer block (Zhou et al, 2020;Schwartz et al, 2020). These techniques are orthogonal to our pruningbased method and have remained unexplored on vision models.…”

Section: Vit Compression Techniquesmentioning

confidence: 99%

NViT: Vision Transformer Compression and Parameter Redistribution

Yang¹,

Yin²,

Shen³

et al. 2021

Preprint

View full text Add to dashboard Cite

Transformers yield state-of-the-art results across many tasks. However, they impose huge computational costs during inference. We apply global structural pruning with latency-aware regularization on all parameters of the Vision Transformer (ViT) model for latency reduction. Furthermore, we analyze the pruned architectures and find interesting regularities in the final weight structure. Our discovered insights lead to a new architecture called NViT (Novel ViT), with a redistribution of where parameters are used. This architecture utilizes parameters more efficiently and enables control of the latency-accuracy trade-off. On ImageNet-1K, we prune the DEIT-Base model to a 2.6× FLOPs reduction, 5.1× parameter reduction, and 1.9× run-time speedup with merely 0.07% loss in accuracy. We achieve more than 1% accuracy gain when compressing the base model to the throughput of the Small/Tiny variants. NViT gains 0.1-1.1% accuracy over the hand-designed DEIT family when trained from scratch, while being faster. * Work done during an internship at NVIDIA.

show abstract