Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT

Shen, Sheng; Dong, Zhen; Ye, Jiayu; Ma, Linjian; Yao, Zhewei; Gholami, Amir; Mahoney, Michael W.; Keutzer, Kurt

doi:10.1609/aaai.v34i05.6409

Cited by 346 publications

(315 citation statements)

References 8 publications

Supporting

Mentioning

313

Contrasting

Order By: Relevance

“…Quantization. Very recent work has shown that large models of language can be made more compact by applying quantization techniques (Han et al, 2016): e.g., quantized versions of Transformer-based machine translation systems (Bhandare et al, 2019) and BERT (Shen et al, 2019;Zhao et al, 2019a;Zafrir et al, 2019) are now available. In this work, we focus on enabling quantization-aware conversational pretraining on the response selection task.…”

Section: More Compact Response Selection Modelmentioning

confidence: 99%

ConveRT: Efficient and Accurate Conversational Representations from Transformers

Henderson¹,

Casanueva²,

Mrkšić³

et al. 2020

Findings of the Association for Computational Linguistics: EMNLP 2020

116

105

View full text Add to dashboard Cite

General-purpose pretrained sentence encoders such as BERT are not ideal for real-world conversational AI applications; they are computationally heavy, slow, and expensive to train. We propose ConveRT (Conversational Representations from Transformers), a pretraining framework for conversational tasks satisfying all the following requirements: it is effective, affordable, and quick to train. We pretrain using a retrieval-based response selection task, effectively leveraging quantization and subword-level parameterization in the dual encoder to build a lightweight memoryand energy-efficient model. We show that Con-veRT achieves state-of-the-art performance across widely established response selection tasks. We also demonstrate that the use of extended dialog history as context yields further performance gains. Finally, we show that pretrained representations from the proposed encoder can be transferred to the intent classification task, yielding strong results across three diverse data sets. ConveRT trains substantially faster than standard sentence encoders or previous state-of-the-art dual encoders. With its reduced size and superior performance, we believe this model promises wider portability and scalability for Conversational AI applications.

show abstract

Section: More Compact Response Selection Modelmentioning

confidence: 99%

ConveRT: Efficient and Accurate Conversational Representations from Transformers

Henderson¹,

Casanueva²,

Mrkšić³

et al. 2020

Findings of the Association for Computational Linguistics: EMNLP 2020

116

105

View full text Add to dashboard Cite

show abstract

“…Quantization is another popular method to decrease model size, which reduces the numerical precision of the model's weights, and therefore both speeds up numerical operations and reduces model size (Wróbel et al, 2018;Shen et al, 2019;Zafrir et al, 2019).…”

Section: Related Workmentioning

confidence: 99%

The Right Tool for the Job: Matching Model and Instance Complexities

Schwartz¹,

Stanovsky²,

Swayamdipta³

et al. 2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

As NLP models become larger, executing a trained model requires significant computational resources incurring monetary and environmental costs. To better respect a given inference budget, we propose a modification to contextual representation fine-tuning which, during inference, allows for an early (and fast) "exit" from neural network calculations for simple instances, and late (and accurate) exit for hard instances. To achieve this, we add classifiers to different layers of BERT and use their calibrated confidence scores to make early exit decisions. We test our proposed modification on five different datasets in two tasks: three text classification datasets and two natural language inference benchmarks. Our method presents a favorable speed/accuracy tradeoff in almost all cases, producing models which are up to five times faster than the state of the art, while preserving their accuracy. Our method also requires almost no additional training resources (in either time or parameters) compared to the baseline BERT model. Finally, our method alleviates the need for costly retraining of multiple models at different levels of efficiency; we allow users to control the inference speed/accuracy tradeoff using a single trained model, by setting a single variable at inference time. We publicly release our code. 1 * Research completed during an internship at AI2.

show abstract

“…reducing the memory footprint of a model by representing its weights by lower-precision values. This method, especially effective when used with specific hardware, has been recently applied by Shen et al (2020) to the transformer architecture.…”

Section: Related Workmentioning

confidence: 99%

Load What You Need: Smaller Versions of Mutililingual BERT

Abdaoui¹,

Pradel²,

Sigel³

2020

Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing

View full text Add to dashboard Cite

Pre-trained Transformer-based models are achieving state-of-the-art results on a variety of Natural Language Processing data sets. However, the size of these models is often a drawback for their deployment in real production applications. In the case of multilingual models, most of the parameters are located in the embeddings layer. Therefore, reducing the vocabulary size should have an important impact on the total number of parameters. In this paper, we propose to generate smaller models that handle fewer number of languages according to the targeted corpora. We present an evaluation of smaller versions of multilingual BERT on the XNLI data set, but we believe that this method may be applied to other multilingual transformers. The obtained results confirm that we can generate smaller models that keep comparable results, while reducing up to 45% of the total number of parameters. We compared our models with DistilmBERT (a distilled version of multilingual BERT) and showed that unlike language reduction, distillation induced a 1.7% to 6% drop in the overall accuracy on the XNLI data set. The presented models and code are publicly available.

show abstract

Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT

Cited by 346 publications

References 8 publications

ConveRT: Efficient and Accurate Conversational Representations from Transformers

ConveRT: Efficient and Accurate Conversational Representations from Transformers

The Right Tool for the Job: Matching Model and Instance Complexities

Load What You Need: Smaller Versions of Mutililingual BERT

Contact Info

Product

Resources

About