Findings of the Association for Computational Linguistics: EMNLP 2020 2020
DOI: 10.18653/v1/2020.findings-emnlp.1
|View full text |Cite
|
Sign up to set email alerts
|

Fully Quantized Transformer for Machine Translation

Abstract: State-of-the-art neural machine translation methods employ massive amounts of parameters. Drastically reducing computational costs of such methods without affecting performance has been up to this point unsuccessful. To this end, we propose FullyQT: an allinclusive quantization strategy for the Transformer. To the best of our knowledge, we are the first to show that it is possible to avoid any loss in translation quality with a fully quantized Transformer. Indeed, compared to fullprecision, our 8-bit models sc… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
29
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 47 publications
(32 citation statements)
references
References 48 publications
0
29
0
Order By: Relevance
“…On Transformer-based models, 8-bit fixedpoint quantization is successfully applied in fullyquantized Transformer (Prato et al, 2019) and Q8BERT (Zafrir et al, 2019). The use of lower bits is also investigated in (Shen et al, 2020;Fan et al, 2020;Zadeh and Moshovos, 2020).…”
Section: Quantizationmentioning
confidence: 99%
See 1 more Smart Citation
“…On Transformer-based models, 8-bit fixedpoint quantization is successfully applied in fullyquantized Transformer (Prato et al, 2019) and Q8BERT (Zafrir et al, 2019). The use of lower bits is also investigated in (Shen et al, 2020;Fan et al, 2020;Zadeh and Moshovos, 2020).…”
Section: Quantizationmentioning
confidence: 99%
“…In addition to weight quantization, further quantizing activations can speed up inference with target hardware by turning floating-point operations into integer or bit operations. In (Prato et al, 2019;Zafrir et al, 2019), 8-bit quantization is successfully applied to Transformer-based models with comparable performance as the full-precision baseline. However, quantizing these models to ultra low bits (e.g., 1 or 2 bits) can be much more challenging due to significant reduction in model capacity.…”
Section: Introductionmentioning
confidence: 99%
“…Quantization is often used to compress transformer models for higher computational and memory efficiency. Recently Prato et al (2020) showed that for machine translation, attention values in transformers can be quantized with only a small impact on accuracy. While their results suggest that full precision attention values may not be necessary for high accuracy, it is unclear if one can retain the accuracy in inference-time quantization in general settings i.e., without retraining.…”
Section: Discussionmentioning
confidence: 99%
“…Uniform quantization for Transformer is explored within reasonable degradation in BLEU score at INT8, while BLEU score can be severely damaged at low bit-precision such as INT4 (Prato et al, 2019). In order to exploit efficient integer arithmetic units with uniformly quantized models, activations need to be quantized as well (Jacob et al, 2018).…”
Section: Related Workmentioning
confidence: 99%
“…While uniform quantization may be effective for memory footprint savings, it would face various issues to improve inference time and to maintain reasonable BLEU score. For example, even integer arithmetic units for inference operations present limited speed up (Bhandare et al, 2019) and resulting BLEU score of quantized Transformer can be substantially degraded with low-bit quantization such as INT4 (Prato et al, 2019).…”
Section: Introductionmentioning
confidence: 99%