Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2021
DOI: 10.18653/v1/2021.emnlp-main.627
|View full text |Cite
|
Sign up to set email alerts
|

Understanding and Overcoming the Challenges of Efficient Transformer Quantization

Abstract: Transformer-based architectures have become the de-facto standard models for a wide range of Natural Language Processing tasks. However, their memory footprint and high latency are prohibitive for efficient deployment and inference on resource-limited devices. In this work, we explore quantization for transformers. We show that transformers have unique quantization challenges -namely, high dynamic activation ranges that are difficult to represent with a low bit fixed-point format. We establish that these activ… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
19
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
2
2

Relationship

0
7

Authors

Journals

citations
Cited by 30 publications
(19 citation statements)
references
References 29 publications
0
19
0
Order By: Relevance
“…This can be attributed to the larger range differences among different channels in large networks compared to smaller networks. As depicted in Table 5, we compared RPTQ with PEG [3]. Due to the original paper was only tested on small models, we applied its method to the OPT model and used the same group settings for PEG as in RPTQ.…”
Section: A2 Comparing With Other Methodsmentioning
confidence: 99%
See 2 more Smart Citations
“…This can be attributed to the larger range differences among different channels in large networks compared to smaller networks. As depicted in Table 5, we compared RPTQ with PEG [3]. Due to the original paper was only tested on small models, we applied its method to the OPT model and used the same group settings for PEG as in RPTQ.…”
Section: A2 Comparing With Other Methodsmentioning
confidence: 99%
“…However, PTQ-SL mainly focuses on the quantization of weights in convolutional networks, and does not address the quantization issues of activations. PGQ [3] employs a range-based permutation of the embedding dimensions and share quantization parameters among elements in the same group to address the problem of activation quantization. Nonetheless, it only consider for the dynamic range and utilizes uniformly divided groups, rendering it less efficacious for LLMs.…”
Section: Quantizationmentioning
confidence: 99%
See 1 more Smart Citation
“…3) Quantization. This technique uses fewer bits to represent the weights of parameterized functions [17,27].…”
Section: B Model Compressionmentioning
confidence: 99%
“…Transformer quantization Compared to the CNNs, transformers with attention layers are naturally more challenging to quantize (Bondarenko et al, 2021). Previous research mainly focused on 8-bit quantization (Zafrir et al, 2019; or 4-bit quantization (Shen et al, 2020;Zadeh et al, 2020).…”
Section: Related Workmentioning
confidence: 99%