Fully Quantized Transformer for Machine Translation

Prato, Gabriele; Charlaix, Ella; Rezagholizadeh, Mehdi

doi:10.18653/v1/2020.findings-emnlp.1

Cited by 47 publications

(32 citation statements)

References 48 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…On Transformer-based models, 8-bit fixedpoint quantization is successfully applied in fullyquantized Transformer (Prato et al, 2019) and Q8BERT (Zafrir et al, 2019). The use of lower bits is also investigated in (Shen et al, 2020;Fan et al, 2020;Zadeh and Moshovos, 2020).…”

Section: Quantizationmentioning

confidence: 99%

“…In addition to weight quantization, further quantizing activations can speed up inference with target hardware by turning floating-point operations into integer or bit operations. In (Prato et al, 2019;Zafrir et al, 2019), 8-bit quantization is successfully applied to Transformer-based models with comparable performance as the full-precision baseline. However, quantizing these models to ultra low bits (e.g., 1 or 2 bits) can be much more challenging due to significant reduction in model capacity.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

TernaryBERT: Distillation-aware Ultra-low Bit BERT

Zhang¹,

Hou²,

Yin³

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

100

View full text Add to dashboard Cite

Transformer-based pre-training models like BERT have achieved remarkable performance in many natural language processing tasks. However, these models are both computation and memory expensive, hindering their deployment to resource-constrained devices. In this work, we propose TernaryBERT, which ternarizes the weights in a fine-tuned BERT model. Specifically, we use both approximation-based and loss-aware ternarization methods and empirically investigate the ternarization granularity of different parts of BERT. Moreover, to reduce the accuracy degradation caused by the lower capacity of low bits, we leverage the knowledge distillation technique (Jiao et al., 2019) in the training process. Experiments on the GLUE benchmark and SQuAD show that our proposed TernaryBERT outperforms the other BERT quantization methods, and even achieves comparable performance as the fullprecision model while being 14.9x smaller.

show abstract

Section: Quantizationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

TernaryBERT: Distillation-aware Ultra-low Bit BERT

Zhang¹,

Hou²,

Yin³

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

100

View full text Add to dashboard Cite

show abstract

“…Quantization is often used to compress transformer models for higher computational and memory efficiency. Recently Prato et al (2020) showed that for machine translation, attention values in transformers can be quantized with only a small impact on accuracy. While their results suggest that full precision attention values may not be necessary for high accuracy, it is unclear if one can retain the accuracy in inference-time quantization in general settings i.e., without retraining.…”

Section: Discussionmentioning

confidence: 99%

On the Distribution, Sparsity, and Inference-time Quantization of Attention Values in Transformers

Ji¹,

Jain²,

Ferdman³

et al. 2021

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

View full text Add to dashboard Cite

How much information do NLP tasks really need from a transformer's attention mechanism at application-time (inference)? From recent work, we know that there is sparsity in transformers and that the floating-points within its computation can be discretized to fewer values with minimal loss to task accuracies. However, this requires retraining or even creating entirely new models, both of which can be expensive and carbon-emitting. Focused on optimizations that do not require training, we systematically study the full range of typical attention values necessary. This informs the design of an inference-time quantization technique using both pruning and logscaled mapping which produces only a few (e.g. 2 3 ) unique values. Over the tasks of question answering and sentiment analysis, we find nearly 80% of attention values can be pruned to zeros with minimal (< 1.0%) relative loss in accuracy. We use this pruning technique in conjunction with quantizing the attention values to only a 3-bit format, without retraining, resulting in only a 0.8% accuracy reduction on question answering with fine-tuned RoBERTa.

show abstract

“…Uniform quantization for Transformer is explored within reasonable degradation in BLEU score at INT8, while BLEU score can be severely damaged at low bit-precision such as INT4 (Prato et al, 2019). In order to exploit efficient integer arithmetic units with uniformly quantized models, activations need to be quantized as well (Jacob et al, 2018).…”

Section: Related Workmentioning

confidence: 99%

“…While uniform quantization may be effective for memory footprint savings, it would face various issues to improve inference time and to maintain reasonable BLEU score. For example, even integer arithmetic units for inference operations present limited speed up (Bhandare et al, 2019) and resulting BLEU score of quantized Transformer can be substantially degraded with low-bit quantization such as INT4 (Prato et al, 2019).…”

Section: Introductionmentioning

confidence: 99%

Extremely Low Bit Transformer Quantization for On-Device Neural Machine Translation

Chung¹,

Kim²,

Choi³

et al. 2020

Findings of the Association for Computational Linguistics: EMNLP 2020

View full text Add to dashboard Cite

The deployment of widely used Transformer architecture is challenging because of heavy computation load and memory overhead during inference, especially when the target device is limited in computational resources such as mobile or edge devices. Quantization is an effective technique to address such challenges. Our analysis shows that for a given number of quantization bits, each block of Transformer contributes to translation quality and inference computations in different manners. Moreover, even inside an embedding block, each word presents vastly different contributions. Correspondingly, we propose a mixed precision quantization strategy to represent Transformer weights by an extremely low number of bits (e.g., under 3 bits). For example, for each word in an embedding block, we assign different quantization bits based on statistical property. Our quantized Transformer model achieves 11.8× smaller model size than the baseline model, with less than -0.5 BLEU. We achieve 8.3× reduction in run-time memory footprints and 3.5× speed up (Galaxy N10+) such that our proposed compression strategy enables efficient implementation for on-device NMT.

show abstract

Fully Quantized Transformer for Machine Translation

Cited by 47 publications

References 48 publications

TernaryBERT: Distillation-aware Ultra-low Bit BERT

TernaryBERT: Distillation-aware Ultra-low Bit BERT

On the Distribution, Sparsity, and Inference-time Quantization of Attention Values in Transformers

Extremely Low Bit Transformer Quantization for On-Device Neural Machine Translation

Contact Info

Product

Resources

About