2021
DOI: 10.48550/arxiv.2109.12948
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Understanding and Overcoming the Challenges of Efficient Transformer Quantization

Abstract: Transformer-based architectures have become the de-facto standard models for a wide range of Natural Language Processing tasks. However, their memory footprint and high latency are prohibitive for efficient deployment and inference on resource-limited devices. In this work, we explore quantization for transformers. We show that transformers have unique quantization challenges -namely, high dynamic activation ranges that are difficult to represent with a low bit fixed-point format. We establish that these activ… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
11
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
1
1

Relationship

0
6

Authors

Journals

citations
Cited by 7 publications
(12 citation statements)
references
References 39 publications
1
11
0
Order By: Relevance
“…Particularly, as ZeroQuant has no activation range calibration phase, the cost of ZeroQuant is 0 which is even cheaper than standard PTQ. As compared to [6], our method achieves a better average score (1.29 higher). Meanwhile, as compared to INT8 activation used in ZeroQuant, [6] uses mixed INT8 and FP16 activation.…”
Section: Main Results Of Bertmentioning
confidence: 86%
See 3 more Smart Citations
“…Particularly, as ZeroQuant has no activation range calibration phase, the cost of ZeroQuant is 0 which is even cheaper than standard PTQ. As compared to [6], our method achieves a better average score (1.29 higher). Meanwhile, as compared to INT8 activation used in ZeroQuant, [6] uses mixed INT8 and FP16 activation.…”
Section: Main Results Of Bertmentioning
confidence: 86%
“…As such, it is hard to get the real inference latency benefit on general compute accelerators, e.g., CPU and GPU, because the parallel processing units in these hardware do not support efficient computation of mixed data types. More recently, [6] introduces high-precision activation quantization (FP16) for part of the model to overcome the high dynamic activation ranges. However, to the best of our knowledge, (1) How to apply PTQ on GPT-3-style models while achieving high accuracy has not been studied in any of previous work yet; (2) How to apply PTQ on billion (or even a dozen of billions) scale model is still under-explored; (3) Efficient inference system backend is still missing, especially for fine-grained quantization schemes, making it hard to achieve low latency on commodity hardware.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…Memory consumption is one of the main dimensions of efficiency and it has been used as a cost indicator in various research works (Bondarenko et al, 2021;Dosovitskiy et al, 2020;Kondratyuk et al, 2021). Reporting memory footprint often is done in form of "peak memory usage", during training that takes into account the memory consumption by the model, optimizer, and the pipeline.…”
Section: Number Of Parametersmentioning
confidence: 99%