Understanding and Overcoming the Challenges of Efficient Transformer Quantization

Yelysei, Bondarenko,; Nagel, Markus; Blankevoort, Tijmen

doi:10.48550/arxiv.2109.12948

Cited by 7 publications

(12 citation statements)

References 39 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Particularly, as ZeroQuant has no activation range calibration phase, the cost of ZeroQuant is 0 which is even cheaper than standard PTQ. As compared to [6], our method achieves a better average score (1.29 higher). Meanwhile, as compared to INT8 activation used in ZeroQuant, [6] uses mixed INT8 and FP16 activation.…”

Section: Main Results Of Bertmentioning

confidence: 86%

“…As such, it is hard to get the real inference latency benefit on general compute accelerators, e.g., CPU and GPU, because the parallel processing units in these hardware do not support efficient computation of mixed data types. More recently, [6] introduces high-precision activation quantization (FP16) for part of the model to overcome the high dynamic activation ranges. However, to the best of our knowledge, (1) How to apply PTQ on GPT-3-style models while achieving high accuracy has not been studied in any of previous work yet; (2) How to apply PTQ on billion (or even a dozen of billions) scale model is still under-explored; (3) Efficient inference system backend is still missing, especially for fine-grained quantization schemes, making it hard to achieve low latency on commodity hardware.…”

Section: Related Workmentioning

confidence: 99%

“…Please see Appendix B.1 for more details. Some work has been done for BERT base models [6] with INT8 weight and mixed INT8/FP16 activation quantization. However, there is no investigation for (1) even lower bit-precision PTQ on BERT models and (2) large-scale GPT-3-style models.…”

Section: Background and Challengementioning

confidence: 99%

“…But most of those works primarily focus on computer vision problems on relatively small scales. More recently, [6] shows promising PTQ results on BERT. However, (1) its main focus is on high-precision quantization (INT8/FP16) on BERT base , (2) it does not consider other billion-scale generative models (GPT-3-style models [8]).…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers

Yao¹,

Aminabadi²,

Zhang³

et al. 2022

Preprint

View full text Add to dashboard Cite

How to efficiently serve ever-larger trained natural language models in practice has become exceptionally challenging even for powerful cloud servers due to their prohibitive memory/computation requirements. In this work, we present an efficient and affordable post-training quantization approach to compress large Transformer-based models, termed as ZeroQuant. ZeroQuant is an end-to-end quantization and inference pipeline with three main components: (1) a fine-grained hardware-friendly quantization scheme for both weight and activations; (2) a novel affordable layer-by-layer knowledge distillation algorithm (LKD) even without the access to the original training data; (3) a highly-optimized quantization system backend support to remove the quantization/dequantization overhead. As such, we are able to show that: (1) ZeroQuant can reduce the precision for weights and activations to INT8 in a cost-free way for both BERT and GPT-3-style models with minimal accuracy impact, which leads to up to 5.19x/4.16x speedup on those models compared to FP16 inference; (2) ZeroQuant plus LKD affordably quantize the weights in the fully-connected module to INT4 along with INT8 weights in the attention module and INT8 activations, resulting in 3x memory footprint reduction compared to the FP16 model; (3) ZeroQuant can be directly applied to two of the largest open-sourced language models, including GPT-J6B and GPT-NeoX20B, for which our INT8 model achieves similar accuracy as the FP16 model but achieves up to 5.2x better efficiency. * Code will be released soon as a part of https://github.com/microsoft/DeepSpeed

show abstract

Section: Main Results Of Bertmentioning

confidence: 86%

Section: Related Workmentioning

confidence: 99%

Section: Background and Challengementioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers

Yao¹,

Aminabadi²,

Zhang³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Memory consumption is one of the main dimensions of efficiency and it has been used as a cost indicator in various research works (Bondarenko et al, 2021;Dosovitskiy et al, 2020;Kondratyuk et al, 2021). Reporting memory footprint often is done in form of "peak memory usage", during training that takes into account the memory consumption by the model, optimizer, and the pipeline.…”

Section: Number Of Parametersmentioning

confidence: 99%

The Efficiency Misnomer

Dehghani¹,

Arnab²,

Beyer³

et al. 2021

Preprint

View full text Add to dashboard Cite

Model efficiency is a critical aspect of developing and deploying machine learning models. Inference time and latency directly affect the user experience, and some applications have hard requirements. In addition to inference costs, model training also have direct financial and environmental impacts. Although there are numerous well-established metrics (cost indicators) for measuring model efficiency, researchers and practitioners often assume that these metrics are correlated with each other and report only few of them. In this paper, we thoroughly discuss common cost indicators, their advantages and disadvantages, and how they can contradict each other. We demonstrate how incomplete reporting of cost indicators can lead to partial conclusions and a blurred or incomplete picture of the practical considerations of different models. We further present suggestions to improve reporting of efficiency metrics.

show abstract