Bilingual Mutual Information Based Adaptive Training for Neural Machine Translation

Xu, Yangyifan; Liu, Yijin; Meng, Fandong; Zhang, Jiajun; Xu, Jinan; Zhou, Jie

doi:10.18653/v1/2021.acl-short.65

Cited by 5 publications

(20 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Mutual information (MI) is a general metric in information theory (Shannon, 1948), which measures the mutual dependence between two random variables a and b as follows 4 : Xu et al (2021) propose token-level bilingual mutual information (BMI) to measure the word mapping diversity between bilinguals and further conduct BMI-based adaptive training for NMT. The BMI is formulated as:…”

Section: Mutual Information For Nmtmentioning

confidence: 99%

“…Transformer base (Vaswani et al, 2017) † 27.30 -Transformer base (Vaswani et al, 2017) 28.10 25.36 + Freq-Exponential (Gu et al, 2020) 28.43 (+0.33) 24.99 (-0.37) + Freq-Chi-Square (Gu et al, 2020) 28.47 (+0.37) 25.43 (+0.07) + BMI-adaptive (Xu et al, 2021) 28.56 (+0.45) 25.77 (+0.41) + Focal Loss (Lin et al, 2017) 28.43 (+0.33) 25.37 (+0.01) + Anti-Focal Loss (Raunak et al, 2020) 28.65 (+0.55) 25.50 (+0.14) + Self-Paced Learning (Wan et al, 2020) 28.69 (+0.59) 25.75 (+0.39) + Simple Fusion (Stahlberg et al, 2018) 27.82 (-0.28) 23.91 (-1.45) + LM Prior (Baziotis et al, 2020) 28 (Vaswani et al, 2017) 29.31 25.48 + Freq-Exponential (Gu et al, 2020) 29.66 (+0.35) 25.57 (+0.09) + Freq-Chi-Square (Gu et al, 2020) 29.64 (+0.33) 25.64 (+0.14) + BMI-adaptive (Xu et al, 2021) 29.69 (+0.38) 25.81 (+0.33) + Focal Loss (Lin et al, 2017) 29.65 (+0.34) 25.54 (+0.06) + Anti-Focal Loss (Raunak et al, 2020) 29.72 (+0.41) 25.64 (+0.16) + Self-Paced Learning (Wan et al, 2020) 29 9) and ( 12). we fix scale s to 0.3 and tune scale t in a similar way.…”

Section: Model Wmt14 En→de Wmt19 Zh→enmentioning

confidence: 99%

“…Recently, various adaptive training approaches (Gu et al, 2020;Xu et al, 2021) have been proposed to alleviate the above problem for NMT. Generally, these approaches re-weight the losses of different target tokens based on specific statistical metrics.…”

Section: Introductionmentioning

confidence: 99%

“…For example, Gu et al (2020) take the token frequency as an indicator and encourage the NMT model to focus more on low-frequency tokens. Xu et al (2021) further propose the bilingual mutual information (BMI) to measure the word mapping diversity between bilinguals, and down-weight the tokens with relatively lower BMI values.…”

Section: Introductionmentioning

confidence: 99%

“…Regarding the computational efficiency, through decomposing the conditional joint distribution in the aforementioned mutual information, our CBMI can be formalized as the log quotient of the translation model probability and language model probability 2 . Therefore, CBMI can be efficiently calculated dur-1 Take the vanilla BMI (Xu et al, 2021) as an example, to process the raw WMT14 En-De training data (about 1.5GB), it takes about 12 CPU hours and 2GB disk storage to save the BMI values. To make matters worse, the cost will increase dozens of times in target-context-aware statistical calculations.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Conditional Bilingual Mutual Information Based Adaptive Training for Neural Machine Translation

Zhang¹,

Liu²,

Meng³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Token-level adaptive training approaches can alleviate the token imbalance problem and thus improve neural machine translation, through re-weighting the losses of different target tokens based on specific statistical metrics (e.g., token frequency or mutual information). Given that standard translation models make predictions on the condition of previous target contexts, we argue that the above statistical metrics ignore target context information and may assign inappropriate weights to target tokens. While one possible solution is to directly take target contexts into these statistical metrics, the target-context-aware statistical computing is extremely expensive, and the corresponding storage overhead is unrealistic. To solve the above issues, we propose a target-context-aware metric, named conditional bilingual mutual information (CBMI), which makes it feasible to supplement target context information for statistical metrics. Particularly, our CBMI can be formalized as the log quotient of the translation model probability and language model probability by decomposing the conditional joint distribution. Thus CBMI can be efficiently calculated during model training without any pre-specific statistical calculations and large storage overhead. Furthermore, we propose an effective adaptive training approach based on both the token-and sentence-level CBMI. Experimental results on WMT14 English-German and WMT19 Chinese-English tasks show our approach can significantly outperform the Transformer baseline and other related methods.

show abstract

Section: Mutual Information For Nmtmentioning

confidence: 99%

Section: Model Wmt14 En→de Wmt19 Zh→enmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Conditional Bilingual Mutual Information Based Adaptive Training for Neural Machine Translation

Zhang¹,

Liu²,

Meng³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

Alignment Offset Based Adaptive Training for Simultaneous Machine Translation

Liang,

Liu,

Meng

et al. 2023

2023 5th International Conference on Natural Language Processing (ICNLP)

View full text Add to dashboard Cite

Frequency-Aware Contrastive Learning for Neural Machine Translation

Zhang

Yang

et al. 2022

AAAI

View full text Add to dashboard Cite

Low-frequency word prediction remains a challenge in modern neural machine translation (NMT) systems. Recent adaptive training methods promote the output of infrequent words by emphasizing their weights in the overall training objectives. Despite the improved recall of low-frequency words, their prediction precision is unexpectedly hindered by the adaptive objectives. Inspired by the observation that low-frequency words form a more compact embedding space, we tackle this challenge from a representation learning perspective. Specifically, we propose a frequency-aware token-level contrastive learning method, in which the hidden state of each decoding step is pushed away from the counterparts of other target words, in a soft contrastive way based on the corresponding word frequencies. We conduct experiments on widely used NIST Chinese-English and WMT14 English-German translation tasks. Empirical results show that our proposed methods can not only significantly improve the translation quality but also enhance lexical diversity and optimize word representation space. Further investigation reveals that, comparing with related adaptive training strategies, the superiority of our method on low-frequency word prediction lies in the robustness of token-level recall across different frequencies without sacrificing precision.

show abstract

Bilingual Mutual Information Based Adaptive Training for Neural Machine Translation

Cited by 5 publications

References 15 publications

Conditional Bilingual Mutual Information Based Adaptive Training for Neural Machine Translation

Conditional Bilingual Mutual Information Based Adaptive Training for Neural Machine Translation

Alignment Offset Based Adaptive Training for Simultaneous Machine Translation

Frequency-Aware Contrastive Learning for Neural Machine Translation

Contact Info

Product

Resources

About