Acquiring Knowledge from Pre-Trained Model to Neural Machine Translation

Weng, Rongxiang; Yu, Heng; Huang, Shujian; Cheng, Shanbo; Luo, Weihua

doi:10.1609/aaai.v34i05.6465

Cited by 54 publications

(40 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Knowledge distillation was proposed to compress models (Buciluȃ et al, 2006) or ensembles of models (Rusu et al, 2016;Hinton et al, 2015;Sanh et al, 2019;Mukherjee and Hassan Awadallah, 2020) via transferring knowledge from one or more models (teacher models) to a smaller one (student model). Besides model compression, knowledge distillation has also been applied to various tasks, like cross-modal learning (Hu et al, 2020), machine translation (Weng et al, 2020), and automated machine learning (Kang et al, 2020).…”

Section: Knowledge Distillationmentioning

confidence: 99%

AdvPicker: Effectively Leveraging Unlabeled Data via Adversarial Discriminator for Cross-Lingual NER

Chen¹,

Jiang²,

Wu³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

Neural methods have been shown to achieve high performance in Named Entity Recognition (NER), but rely on costly high-quality labeled data for training, which is not always available across languages. While previous works have shown that unlabeled data in a target language can be used to improve crosslingual model performance, we propose a novel adversarial approach (AdvPicker) to better leverage such data and further improve results. We design an adversarial learning framework in which an encoder learns entity domain knowledge from labeled source-language data and better shared features are captured via adversarial training -where a discriminator selects less language-dependent target-language data via similarity to the source language. Experimental results on standard benchmark datasets well demonstrate that the proposed method benefits strongly from this data selection process and outperforms existing state-ofthe-art methods; without requiring any additional external resources (e.g., gazetteers or via machine translation). 1

show abstract

Section: Knowledge Distillationmentioning

confidence: 99%

AdvPicker: Effectively Leveraging Unlabeled Data via Adversarial Discriminator for Cross-Lingual NER

Chen¹,

Jiang²,

Wu³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

show abstract

“…However, the combined encoder-decoder model, where the crossattention is randomly initialized, often does not work well because of the lack of semantic interfaces between the pre-trained encoder and decoder. There is also some work trying to leverage BERTlike pre-trained models for MT with an adapter (Guo et al, 2020) or an APT framework (Weng et al, 2020). The former defines additional layers in the pre-trained encoder and decoder during finetuning, while the last adopts a fusion mechanism or knowledge distillation to leverage knowledge in BERT for MT.…”

Section: Related Workmentioning

confidence: 99%

SemFace: Pre-training Encoder and Decoder with a Semantic Interface for Neural Machine Translation

Ren¹,

Zhang²,

Liu³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

While pre-training techniques are working very well in natural language processing, how to pre-train a decoder and effectively leverage it for neural machine translation (NMT) still remains a tricky issue. The main reason is that the cross-attention module between the encoder and decoder cannot be pre-trained, and the combined encoder-decoder model cannot work well in the fine-tuning stage because the inputs of the decoder cross-attention come from unknown encoder outputs. In this paper, we propose a better pre-training method for NMT by defining a semantic interface (SemFace) between the pre-trained encoder and the pre-trained decoder. Specifically, we propose two types of semantic interfaces, including CL-SemFace which regards cross-lingual embeddings as an interface, and VQ-SemFace which employs vector quantized embeddings to constrain the encoder outputs and decoder inputs in the same language-independent space. We conduct massive experiments on six supervised translation pairs and three unsupervised pairs. Experimental results demonstrate that our proposed Sem-Face can effectively connect the pre-trained encoder and decoder, and achieves significant improvement by 3.7 and 1.5 BLEU points on the two tasks respectively compared with previous pre-training-based NMT models. * Contribution during internship at MSRA.

show abstract

“…MLMs are generally used in downstream tasks by fine-tuning (Liu, 2019;Zhang et al, 2019), however, Zhu et al (2020) demonstrated that it is more effective to provide the output of the final layer of a MLM to the EncDec model as contextual embeddings. Recently, Weng et al (2019) addressed the mismatch problem between contextual knowledge from pre-trained models and the target bilingual machine translation. Here, we also claim that addressing the gap between grammatically correct raw corpora and GEC corpora can lead to the improvement of GEC systems.…”

Section: Related Workmentioning

confidence: 99%

Encoder-Decoder Models Can Benefit from Pre-trained Masked Language Models in Grammatical Error Correction

Kaneko¹,

Mita²,

Kiyono

et al. 2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

This paper investigates how to effectively incorporate a pre-trained masked language model (MLM), such as BERT, into an encoderdecoder (EncDec) model for grammatical error correction (GEC). The answer to this question is not as straightforward as one might expect because the previous common methods for incorporating a MLM into an EncDec model have potential drawbacks when applied to GEC. For example, the distribution of the inputs to a GEC model can be considerably different (erroneous, clumsy, etc.) from that of the corpora used for pre-training MLMs; however, this issue is not addressed in the previous methods. Our experiments show that our proposed method, where we first fine-tune a MLM with a given GEC corpus and then use the output of the finetuned MLM as additional features in the GEC model, maximizes the benefit of the MLM. The best-performing model achieves state-ofthe-art performances on the BEA-2019 and CoNLL-2014 benchmarks. Our code is publicly available at: https://github.com/ kanekomasahiro/bert-gec.

show abstract

Acquiring Knowledge from Pre-Trained Model to Neural Machine Translation

Cited by 54 publications

References 19 publications

AdvPicker: Effectively Leveraging Unlabeled Data via Adversarial Discriminator for Cross-Lingual NER

AdvPicker: Effectively Leveraging Unlabeled Data via Adversarial Discriminator for Cross-Lingual NER

SemFace: Pre-training Encoder and Decoder with a Semantic Interface for Neural Machine Translation

Encoder-Decoder Models Can Benefit from Pre-trained Masked Language Models in Grammatical Error Correction

Contact Info

Product

Resources

About