Hint-Based Training for Non-Autoregressive Machine Translation

Li, Zhuohan; Lin, Zi; He, Di; Tian, Fei; Qin, Tao; Wang, Liwei; Liu, Tie-Yan

doi:10.18653/v1/d19-1573

Cited by 58 publications

(55 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We modify the attention mask so that it does not mask out the future tokens, and every token is 1 By 'greedy', we mean decoding with a beam width of 1. dependent on both its preceding and succeeding tokens in every layer. Gu et al (2017), Lee et al (2018), Li et al (2019) and use an additional positional self-attention module in each of the decoder layers, but we do not apply such a layer. It did not provide a clear performance improvement in our experiments, and we wanted to reduce the number of deviations from the base transformer structure.…”

Section: Model Structurementioning

confidence: 99%

“…We use a simple method to select the target length for NAR generation at test time Li et al, 2019), where we set the target length to be T = T + C, where C is a constant term estimated from the parallel data and T is the length of the source sentence. We then create a list of candidate target lengths ranging from [T − B, T + B] where B is the half-width of the interval.…”

Section: Length Predictionmentioning

confidence: 99%

See 1 more Smart Citation

Improving Non-autoregressive Neural Machine Translation with Monolingual Data

Zhou¹,

Keung²

2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

Non-autoregressive (NAR) neural machine translation is usually done via knowledge distillation from an autoregressive (AR) model. Under this framework, we leverage large monolingual corpora to improve the NAR model's performance, with the goal of transferring the AR model's generalization ability while preventing overfitting. On top of a strong NAR baseline, our experimental results on the WMT14 En-De and WMT16 En-Ro news translation tasks confirm that monolingual data augmentation consistently improves the performance of the NAR model to approach the teacher AR model's performance, yields comparable or better results than the best non-iterative NAR methods in the literature and helps reduce overfitting in the training process.

show abstract

Section: Model Structurementioning

confidence: 99%

Section: Length Predictionmentioning

confidence: 99%

Improving Non-autoregressive Neural Machine Translation with Monolingual Data

Zhou¹,

Keung²

2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

show abstract

“…Non-autoregressive (NAR) models (Oord et al, 2017;Gu et al, 2017;, which generate all the tokens in a target sequence in parallel and can speed up inference, are widely explored in natural language and speech processing tasks such as neural machine translation (NMT) (Gu et al, 2017;Guo et al, 2019a;Li et al, 2019b;Guo et al, 2019b), automatic speech recognition (ASR) and text to speech (TTS) synthesis (Oord et al, 2017;. However, NAR models usually lead to lower accuracy than their autoregressive (AR) counterparts since the inner dependencies among the target tokens are explicitly removed.…”

Section: Introductionmentioning

confidence: 99%

“…Several techniques have been proposed to alleviate the accuracy degradation, including 1) knowledge distillation (Oord et al, 2017;Gu et al, 2017;Guo et al, 2019a,b;, 2) imposing source-target alignment constraint with fertility (Gu et al, 2017), word mapping (Guo et al, 2019a), attention distillation (Li et al, 2019b) and duration prediction . With the help of those techniques, it is observed that NAR models can match the accuracy of AR models for some tasks , but the gap still exists for some other tasks (Gu et al, 2017;.…”

Section: Introductionmentioning

confidence: 99%

A Study of Non-autoregressive Model for Sequence Generation

Ren

Liu

Tan

et al. 2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Self Cite

View full text Add to dashboard Cite

show abstract

“…Under the regularization of scenario knowledge, the student is effectively guided towards a wider local minimum that represents better generalization performance (Chaudhari et al, 2017;Keskar et al, 2017). To facilitate knowledge transfer, the student mimics the teacher on every layer instead of just the top layer, which alleviates the delayed supervised signal problem using hierarchical semantic information in the teacher (Li et al, 2019a). Besides containing the information of future conversations, the distilled knowledge (Hinton et al, 2015) is also a less noisy and more "deterministic" supervised signal in comparison to real-world responses (Lee et al, 2018;Guo et al, 2019), which provides the student with smoother sequence trajectories that are easier to fit.…”

Section: Introductionmentioning

confidence: 99%

Regularizing Dialogue Generation by Imitating Implicit Scenarios

Feng

Ren

Chen

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

Human dialogues are scenario-based and appropriate responses generally relate to the latent context knowledge entailed by the specific scenario. To enable responses that are more meaningful and context-specific, we propose to improve generative dialogue systems from the scenario perspective, where both dialogue history and future conversation are taken into account to implicitly reconstruct the scenario knowledge. More importantly, the conversation scenarios are further internalized using imitation learning framework, where the conventional dialogue model that has no access to future conversations is effectively regularized by transferring the scenario knowledge contained in hierarchical supervising signals from the scenario-based dialogue model, so that the future conversation is not required in actual inference. Extensive evaluations show that our approach significantly outperforms state-of-theart baselines on diversity and relevance, and expresses scenario-specific knowledge. But the ad said they will be sold for a week. I'm glad these cheap and cheerful batteries are on sale. Unfortunately, all types of batteries are costing more these days. Could you cut the price a little, please? Scenario 1 Scenario 2But the ad said they will be sold for a week.

show abstract

Hint-Based Training for Non-Autoregressive Machine Translation

Cited by 58 publications

References 21 publications

Improving Non-autoregressive Neural Machine Translation with Monolingual Data

Improving Non-autoregressive Neural Machine Translation with Monolingual Data

A Study of Non-autoregressive Model for Sequence Generation

Regularizing Dialogue Generation by Imitating Implicit Scenarios

Contact Info

Product

Resources

About