Previous translation models like statistical machine translation (SMT), rule-based machine translation (RBMT), hybrid machine translation (HMT), and neural machine translation (NMT) have reached their performance bottleneck. The new Transformer-based machine translation model has become the favorite choice for English language translation. For instance, Google’s BERT translation model organizes the Transformer module into bidirectional encoder representations. It is aware of the users’ search intentions as well as the material that the search engine has indexed. It does not need to evaluate previous searches to comprehend what people mean, unlike RankBrain. BERT comprehends words, sentences, and complete information in the same way that we do. It achieves remarkable translation quality improvement over the other state-of-the-art benchmarks. It demonstrates the great potential of the Transformer model. The Transformer-based translation model mainly improves the performance at the cost of growing model sizes and complexity, usually requiring million-scale parameters. It is hard for the traditional computing systems to cope with the growing memory and computation requirements. However, the latest computers can easily run this model without any lag. The biggest challenge of applying the Transformer model is to deploy these models efficiently onto real-time or embedded devices. In this work, we propose a quantization scheme to reduce the parameter and computation complexity. It is of great importance to promote the usage of the Transformer model. Our experiment results show that the original Transformer model in 32 bit floating-point can be quantized to only 8 bits to 12 bits with only negligible translation quality loss. However, due to the perfect transformation of the block part, this quality loss part can easily be managed by the users. Meanwhile, our algorithm achieves
2.6
×
to
4.0
×
compression ratio, which is helpful to save the required complexity and energy during the inference phase.