Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen 2019
DOI: 10.18653/v1/d19-1633
|View full text |Cite
|
Sign up to set email alerts
|

Mask-Predict: Parallel Decoding of Conditional Masked Language Models

Abstract: Most machine translation systems generate text autoregressively from left to right. We, instead, use a masked language modeling objective to train a model to predict any subset of the target words, conditioned on both the input text and a partially masked target translation. This approach allows for efficient iterative decoding, where we first predict all of the target words non-autoregressively, and then repeatedly mask out and regenerate the subset of words that the model is least confident about. By applyin… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

7
629
1
1

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
4
2

Relationship

0
10

Authors

Journals

citations
Cited by 389 publications
(638 citation statements)
references
References 15 publications
7
629
1
1
Order By: Relevance
“…10 Iterative refinement. We found it useful to iteratively refine predictions during inference, similar to techniques from non-autoregressive Machine Translation (Ghazvininejad et al, 2019). We start with a wordpiece-tokenized input, e.g.…”
Section: Finetuning We Finetune E-bert-mlm On the Training Set To Minmentioning
confidence: 99%
“…10 Iterative refinement. We found it useful to iteratively refine predictions during inference, similar to techniques from non-autoregressive Machine Translation (Ghazvininejad et al, 2019). We start with a wordpiece-tokenized input, e.g.…”
Section: Finetuning We Finetune E-bert-mlm On the Training Set To Minmentioning
confidence: 99%
“…Non-autoregressive neural machine translation (Gu et al, 2018) aims to enable the parallel generation of output tokens without sacrificing translation quality. There has been a surge of recent interest in this family of efficient decoding models, resulting in the development of iterative refinement (Lee et al, 2018), CTC models (Libovicky and Helcl, 2018), insertion-based methods Chan et al, 2019b), editbased methods (Gu et al, 2019;Ruis et al, 2019), masked language models (Ghazvininejad et al, 2019(Ghazvininejad et al, , 2020b, and normalizing flow models (Ma et al, 2019). Some of these methods generate the output tokens in a constant number of steps (Gu et al, 2018;Libovicky and Helcl, 2018;Lee et al, 2018;Ghazvininejad et al, 2019Ghazvininejad et al, , 2020b, while others require a logarithmic number of generation steps Chan et al, 2019b,a;Li and Chan, 2019).…”
Section: Introductionmentioning
confidence: 99%
“…However, we found that directly optimizing prompts guided by gradients was unstable and often yielded prompts in unnatural English in our preliminary experiments. Thus, we instead resorted to a more straightforward hillclimbing method that starts with an initial prompt, then masks out one token at a time and replaces it with the most probable token conditioned on the other tokens, inspired by the mask-predict decoding algorithm used in non-autoregressive machine translation (Ghazvininejad et al, 2019)…”
Section: Lm-aware Prompt Generationmentioning
confidence: 99%