Mask-Predict: Parallel Decoding of Conditional Masked Language Models

Ghazvininejad, Marjan; Levy, Omer; Liu, Yinhan; Zettlemoyer, Luke

doi:10.18653/v1/d19-1633

Cited by 389 publications

(638 citation statements)

References 15 publications

Supporting

Mentioning

629

Contrasting

Unclassified

Order By: Relevance

“…10 Iterative refinement. We found it useful to iteratively refine predictions during inference, similar to techniques from non-autoregressive Machine Translation (Ghazvininejad et al, 2019). We start with a wordpiece-tokenized input, e.g.…”

Section: Finetuning We Finetune E-bert-mlm On the Training Set To Minmentioning

confidence: 99%

E-BERT: Efficient-Yet-Effective Entity Embeddings for BERT

Poerner

Waltinger

Schütze

2020

Findings of the Association for Computational Linguistics: EMNLP 2020

116

View full text Add to dashboard Cite

We present a novel way of injecting factual knowledge about entities into the pretrained BERT model (Devlin et al., 2019): We align Wikipedia2Vec entity vectors (Yamada et al., 2016) with BERT's native wordpiece vector space and use the aligned entity vectors as if they were wordpiece vectors. The resulting entity-enhanced version of BERT (called E-BERT) is similar in spirit to ERNIE (Zhang et al., 2019) and KnowBert (Peters et al., 2019), but it requires no expensive further pretraining of the BERT encoder. We evaluate E-BERT on unsupervised question answering (QA), supervised relation classification (RC) and entity linking (EL). On all three tasks, E-BERT outperforms BERT and other baselines. We also show quantitatively that the original BERT model is overly reliant on the surface form of entity names (e.g., guessing that someone with an Italian-sounding name speaks Italian), and that E-BERT mitigates this problem.

show abstract

Section: Finetuning We Finetune E-bert-mlm On the Training Set To Minmentioning

confidence: 99%

E-BERT: Efficient-Yet-Effective Entity Embeddings for BERT

Poerner

Waltinger

Schütze

2020

Findings of the Association for Computational Linguistics: EMNLP 2020

116

View full text Add to dashboard Cite

show abstract

“…Non-autoregressive neural machine translation (Gu et al, 2018) aims to enable the parallel generation of output tokens without sacrificing translation quality. There has been a surge of recent interest in this family of efficient decoding models, resulting in the development of iterative refinement (Lee et al, 2018), CTC models (Libovicky and Helcl, 2018), insertion-based methods Chan et al, 2019b), editbased methods (Gu et al, 2019;Ruis et al, 2019), masked language models (Ghazvininejad et al, 2019(Ghazvininejad et al, , 2020b, and normalizing flow models (Ma et al, 2019). Some of these methods generate the output tokens in a constant number of steps (Gu et al, 2018;Libovicky and Helcl, 2018;Lee et al, 2018;Ghazvininejad et al, 2019Ghazvininejad et al, , 2020b, while others require a logarithmic number of generation steps Chan et al, 2019b,a;Li and Chan, 2019).…”

Section: Introductionmentioning

confidence: 99%

Non-Autoregressive Machine Translation with Latent Alignments

Saharia¹,

Chan²,

Saxena³

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

This paper presents two strong methods, CTC and Imputer, for non-autoregressive machine translation that model latent alignments with dynamic programming. We revisit CTC for machine translation and demonstrate that a simple CTC model can achieve state-of-theart for single-step non-autoregressive machine translation, contrary to what prior work indicates. In addition, we adapt the Imputer model for non-autoregressive machine translation and demonstrate that Imputer with just 4 generation steps can match the performance of an autoregressive Transformer baseline. Our latent alignment models are simpler than many existing non-autoregressive translation baselines; for example, we do not require target length prediction or re-scoring with an autoregressive model. On the competitive WMT'14 En→De task, our CTC model achieves 25.7 BLEU with a single generation step, while Imputer achieves 27.5 BLEU with 2 generation steps, and 28.0 BLEU with 4 generation steps. This compares favourably to the autoregressive Transformer baseline at 27.8 BLEU. * Equal contribution. † Work done as part of the Google AI Residency.

show abstract

“…However, we found that directly optimizing prompts guided by gradients was unstable and often yielded prompts in unnatural English in our preliminary experiments. Thus, we instead resorted to a more straightforward hillclimbing method that starts with an initial prompt, then masks out one token at a time and replaces it with the most probable token conditioned on the other tokens, inspired by the mask-predict decoding algorithm used in non-autoregressive machine translation (Ghazvininejad et al, 2019)…”

Section: Lm-aware Prompt Generationmentioning

confidence: 99%

How Can We Know What Language Models Know?

Jiang

Araki³

et al. 2020

Transactions of the Association for Computational Linguistics

752

490

View full text Add to dashboard Cite

Recent work has presented intriguing results examining the knowledge contained in language models (LMs) by having the LM fill in the blanks of prompts such as “ Obama is a __ by profession”. These prompts are usually manually created, and quite possibly sub-optimal; another prompt such as “ Obama worked as a __ ” may result in more accurately predicting the correct profession. Because of this, given an inappropriate prompt, we might fail to retrieve facts that the LM does know, and thus any given prompt only provides a lower bound estimate of the knowledge contained in an LM. In this paper, we attempt to more accurately estimate the knowledge contained in LMs by automatically discovering better prompts to use in this querying process. Specifically, we propose mining-based and paraphrasing-based methods to automatically generate high-quality and diverse prompts, as well as ensemble methods to combine answers from different prompts. Extensive experiments on the LAMA benchmark for extracting relational knowledge from LMs demonstrate that our methods can improve accuracy from 31.1% to 39.6%, providing a tighter lower bound on what LMs know. We have released the code and the resulting LM Prompt And Query Archive (LPAQA) at https://github.com/jzbjyb/LPAQA .

show abstract

Mask-Predict: Parallel Decoding of Conditional Masked Language Models

Cited by 389 publications

References 15 publications

E-BERT: Efficient-Yet-Effective Entity Embeddings for BERT

E-BERT: Efficient-Yet-Effective Entity Embeddings for BERT

Non-Autoregressive Machine Translation with Latent Alignments

How Can We Know What Language Models Know?

Contact Info

Product

Resources

About