2020
DOI: 10.1609/aaai.v34i05.6465
|View full text |Cite
|
Sign up to set email alerts
|

Acquiring Knowledge from Pre-Trained Model to Neural Machine Translation

Abstract: Pre-training and fine-tuning have achieved great success in natural language process field. The standard paradigm of exploiting them includes two steps: first, pre-training a model, e.g. BERT, with a large scale unlabeled monolingual data. Then, fine-tuning the pre-trained model with labeled data from downstream tasks. However, in neural machine translation (NMT), we address the problem that the training objective of the bilingual task is far different from the monolingual pre-trained model. This gap leads tha… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
39
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
4
1

Relationship

1
9

Authors

Journals

citations
Cited by 54 publications
(40 citation statements)
references
References 19 publications
1
39
0
Order By: Relevance
“…Knowledge distillation was proposed to compress models (Buciluȃ et al, 2006) or ensembles of models (Rusu et al, 2016;Hinton et al, 2015;Sanh et al, 2019;Mukherjee and Hassan Awadallah, 2020) via transferring knowledge from one or more models (teacher models) to a smaller one (student model). Besides model compression, knowledge distillation has also been applied to various tasks, like cross-modal learning (Hu et al, 2020), machine translation (Weng et al, 2020), and automated machine learning (Kang et al, 2020).…”
Section: Knowledge Distillationmentioning
confidence: 99%
“…Knowledge distillation was proposed to compress models (Buciluȃ et al, 2006) or ensembles of models (Rusu et al, 2016;Hinton et al, 2015;Sanh et al, 2019;Mukherjee and Hassan Awadallah, 2020) via transferring knowledge from one or more models (teacher models) to a smaller one (student model). Besides model compression, knowledge distillation has also been applied to various tasks, like cross-modal learning (Hu et al, 2020), machine translation (Weng et al, 2020), and automated machine learning (Kang et al, 2020).…”
Section: Knowledge Distillationmentioning
confidence: 99%
“…However, the combined encoder-decoder model, where the crossattention is randomly initialized, often does not work well because of the lack of semantic interfaces between the pre-trained encoder and decoder. There is also some work trying to leverage BERTlike pre-trained models for MT with an adapter (Guo et al, 2020) or an APT framework (Weng et al, 2020). The former defines additional layers in the pre-trained encoder and decoder during finetuning, while the last adopts a fusion mechanism or knowledge distillation to leverage knowledge in BERT for MT.…”
Section: Related Workmentioning
confidence: 99%
“…MLMs are generally used in downstream tasks by fine-tuning (Liu, 2019;Zhang et al, 2019), however, Zhu et al (2020) demonstrated that it is more effective to provide the output of the final layer of a MLM to the EncDec model as contextual embeddings. Recently, Weng et al (2019) addressed the mismatch problem between contextual knowledge from pre-trained models and the target bilingual machine translation. Here, we also claim that addressing the gap between grammatically correct raw corpora and GEC corpora can lead to the improvement of GEC systems.…”
Section: Related Workmentioning
confidence: 99%