2020
DOI: 10.1609/aaai.v34i05.6409
|View full text |Cite
|
Sign up to set email alerts
|

Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT

Abstract: Transformer based architectures have become de-facto models used for a range of Natural Language Processing tasks. In particular, the BERT based models achieved significant accuracy gain for GLUE tasks, CoNLL-03 and SQuAD. However, BERT based models have a prohibitive memory footprint and latency. As a result, deploying BERT based models in resource constrained environments has become a challenging task. In this work, we perform an extensive analysis of fine-tuned BERT models using second order Hessian informa… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
313
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
5
5

Relationship

0
10

Authors

Journals

citations
Cited by 346 publications
(315 citation statements)
references
References 8 publications
2
313
0
Order By: Relevance
“…Quantization. Very recent work has shown that large models of language can be made more compact by applying quantization techniques (Han et al, 2016): e.g., quantized versions of Transformer-based machine translation systems (Bhandare et al, 2019) and BERT (Shen et al, 2019;Zhao et al, 2019a;Zafrir et al, 2019) are now available. In this work, we focus on enabling quantization-aware conversational pretraining on the response selection task.…”
Section: More Compact Response Selection Modelmentioning
confidence: 99%
“…Quantization. Very recent work has shown that large models of language can be made more compact by applying quantization techniques (Han et al, 2016): e.g., quantized versions of Transformer-based machine translation systems (Bhandare et al, 2019) and BERT (Shen et al, 2019;Zhao et al, 2019a;Zafrir et al, 2019) are now available. In this work, we focus on enabling quantization-aware conversational pretraining on the response selection task.…”
Section: More Compact Response Selection Modelmentioning
confidence: 99%
“…Quantization is another popular method to decrease model size, which reduces the numerical precision of the model's weights, and therefore both speeds up numerical operations and reduces model size (Wróbel et al, 2018;Shen et al, 2019;Zafrir et al, 2019).…”
Section: Related Workmentioning
confidence: 99%
“…reducing the memory footprint of a model by representing its weights by lower-precision values. This method, especially effective when used with specific hardware, has been recently applied by Shen et al (2020) to the transformer architecture.…”
Section: Related Workmentioning
confidence: 99%