Understanding and Overcoming the Challenges of Efficient Transformer Quantization

Yelysei, Bondarenko,; Nagel, Markus; Blankevoort, Tijmen

doi:10.18653/v1/2021.emnlp-main.627

Cited by 30 publications

(19 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This can be attributed to the larger range differences among different channels in large networks compared to smaller networks. As depicted in Table 5, we compared RPTQ with PEG [3]. Due to the original paper was only tested on small models, we applied its method to the OPT model and used the same group settings for PEG as in RPTQ.…”

Section: A2 Comparing With Other Methodsmentioning

confidence: 99%

“…However, PTQ-SL mainly focuses on the quantization of weights in convolutional networks, and does not address the quantization issues of activations. PGQ [3] employs a range-based permutation of the embedding dimensions and share quantization parameters among elements in the same group to address the problem of activation quantization. Nonetheless, it only consider for the dynamic range and utilizes uniformly divided groups, rendering it less efficacious for LLMs.…”

Section: Quantizationmentioning

confidence: 99%

“…We will evaluate our proposed reorder-based post-training quantization (RPTQ) on large language models. As our work focus on processing the problem in quantizing activations, we use GPTQ [12] to quantize the weights in LLMs 3 . We apply static quantization to all the weights and input activations.…”

Section: Settingsmentioning

confidence: 99%

See 2 more Smart Citations

WuDaoCorpora: A super large-scale Chinese corpora for pre-training language models

Yuan

Zhao

et al. 2021

AI Open

View full text Add to dashboard Cite

Section: A2 Comparing With Other Methodsmentioning

confidence: 99%

Section: Quantizationmentioning

confidence: 99%

See 1 more Smart Citation

WuDaoCorpora: A super large-scale Chinese corpora for pre-training language models

Yuan

Zhao

et al. 2021

AI Open

View full text Add to dashboard Cite

“…3) Quantization. This technique uses fewer bits to represent the weights of parameterized functions [17,27].…”

Section: B Model Compressionmentioning

confidence: 99%

Greener yet Powerful: Taming Large Code Generation Models with Quantization

Wei¹,

Gonugondla²,

Ahmad³

et al. 2023

Preprint

View full text Add to dashboard Cite

ML-powered code generation aims to assist developers to write code in a more productive manner, by intelligently generating code blocks based on natural language prompts. Recently, large pretrained deep learning models have substantially pushed the boundary of code generation and achieved impressive performance. Despite their great power, the huge number of model parameters poses a significant threat to adapting them in a regular software development environment, where a developer might use a standard laptop or mid-size server to develop her code. Such large models incur significant resource usage (in terms of memory, latency, and dollars) as well as carbon footprint.Model compression is a promising approach to address these challenges. Several techniques are proposed to compress large pretrained models typically used for vision or textual data. Out of many available compression techniques, we identified that quantization is mostly applicable for code generation task as it does not require significant retraining cost. As quantization represents model parameters with lower-bit integer (e.g., int8), the model size and runtime latency would both benefit from such int representation. We extensively study the impact of quantized model on code generation tasks across different dimension: (i) resource usage and carbon footprint, (ii) accuracy, and (iii) robustness. To this end, through systematic experiments we find a recipe of quantization technique that could run even a 6B model in a regular laptop without significant accuracy or robustness degradation. We further found the recipe is readily applicable to code summarization task as well.

show abstract

“…Transformer quantization Compared to the CNNs, transformers with attention layers are naturally more challenging to quantize (Bondarenko et al, 2021). Previous research mainly focused on 8-bit quantization (Zafrir et al, 2019; or 4-bit quantization (Shen et al, 2020;Zadeh et al, 2020).…”

Section: Related Workmentioning

confidence: 99%

BiT: Robustly Binarized Multi-distilled Transformer

Liu¹,

Oğuz²,

Pappu³

et al. 2022

Preprint

View full text Add to dashboard Cite

Modern pre-trained transformers have rapidly advanced the state-of-the-art in machine learning, but have also grown in parameters and computational complexity, making them increasingly difficult to deploy in resource-constrained environments. Binarization of the weights and activations of the network can significantly alleviate these issues, however is technically challenging from an optimization perspective. In this work, we identify a series of improvements which enables binary transformers at a much higher accuracy than what was possible previously. These include a two-set binarization scheme, a novel elastic binary activation function with learned parameters, and a method to quantize a network to its limit by successively distilling higher precision models into lower precision students. These approaches allow for the first time, fully binarized transformer models that are at a practical level of accuracy, approaching a full-precision BERT baseline on the GLUE language understanding benchmark within as little as 5.9%. * Equal contribution Preprint. Under review.

show abstract

Understanding and Overcoming the Challenges of Efficient Transformer Quantization

Cited by 30 publications

References 29 publications

WuDaoCorpora: A super large-scale Chinese corpora for pre-training language models

WuDaoCorpora: A super large-scale Chinese corpora for pre-training language models

Greener yet Powerful: Taming Large Code Generation Models with Quantization

BiT: Robustly Binarized Multi-distilled Transformer

Contact Info

Product

Resources

About