Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen 2019
DOI: 10.18653/v1/d19-1573
|View full text |Cite
|
Sign up to set email alerts
|

Hint-Based Training for Non-Autoregressive Machine Translation

et al.

Abstract: Due to the unparallelizable nature of the autoregressive factorization, AutoRegressive Translation (ART) models have to generate tokens sequentially during decoding and thus suffer from high inference latency. Non-AutoRegressive Translation (NART) models were proposed to reduce the inference time, but could only achieve inferior translation accuracy. In this paper, we proposed a novel approach to leveraging the hints from hidden states and word alignments to help the training of NART models. The results achiev… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
55
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
3
3
2

Relationship

1
7

Authors

Journals

citations
Cited by 58 publications
(55 citation statements)
references
References 21 publications
0
55
0
Order By: Relevance
“…We modify the attention mask so that it does not mask out the future tokens, and every token is 1 By 'greedy', we mean decoding with a beam width of 1. dependent on both its preceding and succeeding tokens in every layer. Gu et al (2017), Lee et al (2018), Li et al (2019) and use an additional positional self-attention module in each of the decoder layers, but we do not apply such a layer. It did not provide a clear performance improvement in our experiments, and we wanted to reduce the number of deviations from the base transformer structure.…”
Section: Model Structurementioning
confidence: 99%
See 1 more Smart Citation
“…We modify the attention mask so that it does not mask out the future tokens, and every token is 1 By 'greedy', we mean decoding with a beam width of 1. dependent on both its preceding and succeeding tokens in every layer. Gu et al (2017), Lee et al (2018), Li et al (2019) and use an additional positional self-attention module in each of the decoder layers, but we do not apply such a layer. It did not provide a clear performance improvement in our experiments, and we wanted to reduce the number of deviations from the base transformer structure.…”
Section: Model Structurementioning
confidence: 99%
“…We use a simple method to select the target length for NAR generation at test time Li et al, 2019), where we set the target length to be T = T + C, where C is a constant term estimated from the parallel data and T is the length of the source sentence. We then create a list of candidate target lengths ranging from [T − B, T + B] where B is the half-width of the interval.…”
Section: Length Predictionmentioning
confidence: 99%
“…Non-autoregressive (NAR) models (Oord et al, 2017;Gu et al, 2017;, which generate all the tokens in a target sequence in parallel and can speed up inference, are widely explored in natural language and speech processing tasks such as neural machine translation (NMT) (Gu et al, 2017;Guo et al, 2019a;Li et al, 2019b;Guo et al, 2019b), automatic speech recognition (ASR) and text to speech (TTS) synthesis (Oord et al, 2017;. However, NAR models usually lead to lower accuracy than their autoregressive (AR) counterparts since the inner dependencies among the target tokens are explicitly removed.…”
Section: Introductionmentioning
confidence: 99%
“…Several techniques have been proposed to alleviate the accuracy degradation, including 1) knowledge distillation (Oord et al, 2017;Gu et al, 2017;Guo et al, 2019a,b;, 2) imposing source-target alignment constraint with fertility (Gu et al, 2017), word mapping (Guo et al, 2019a), attention distillation (Li et al, 2019b) and duration prediction . With the help of those techniques, it is observed that NAR models can match the accuracy of AR models for some tasks , but the gap still exists for some other tasks (Gu et al, 2017;.…”
Section: Introductionmentioning
confidence: 99%
“…Under the regularization of scenario knowledge, the student is effectively guided towards a wider local minimum that represents better generalization performance (Chaudhari et al, 2017;Keskar et al, 2017). To facilitate knowledge transfer, the student mimics the teacher on every layer instead of just the top layer, which alleviates the delayed supervised signal problem using hierarchical semantic information in the teacher (Li et al, 2019a). Besides containing the information of future conversations, the distilled knowledge (Hinton et al, 2015) is also a less noisy and more "deterministic" supervised signal in comparison to real-world responses (Lee et al, 2018;Guo et al, 2019), which provides the student with smoother sequence trajectories that are easier to fit.…”
Section: Introductionmentioning
confidence: 99%