Findings of the Association for Computational Linguistics: EMNLP 2021 2021
DOI: 10.18653/v1/2021.findings-emnlp.233
|View full text |Cite
|
Sign up to set email alerts
|

Multilingual Translation via Grafting Pre-trained Language Models

Abstract: Can pre-trained BERT for one language and GPT for another be glued together to translate texts? Self-supervised training using only monolingual data has led to the success of pretrained (masked) language models in many NLP tasks. However, directly connecting BERT as an encoder and GPT as a decoder can be challenging in machine translation, for GPT-like models lack a cross-attention component that is needed in seq2seq decoders. In this paper, we propose Graformer to graft separately pre-trained (masked) languag… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
11
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
2
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 12 publications
(12 citation statements)
references
References 42 publications
1
11
0
Order By: Relevance
“…Similar to the modern ST architectures (Gállego et al, 2021;, we use pretrained W2V2 large as our encoder and pretrained mBART50 decoder as the decoder. We randomly initialize the top 3 layers of W2V2 in experiments involving RedApt and find that it enables faster convergence, verifying earlier observations by Sun et al (2021). We free W2V2 feature extractor.…”
Section: Experimental Settingssupporting
confidence: 72%
“…Similar to the modern ST architectures (Gállego et al, 2021;, we use pretrained W2V2 large as our encoder and pretrained mBART50 decoder as the decoder. We randomly initialize the top 3 layers of W2V2 in experiments involving RedApt and find that it enables faster convergence, verifying earlier observations by Sun et al (2021). We free W2V2 feature extractor.…”
Section: Experimental Settingssupporting
confidence: 72%
“…Sun et al [19] proposed a grafting approach, where they stack multiple sets of BERT encoder layers that follow autoregressive manner and GPT decoder decoder layers, and freeze selections of encoder or decoder parameters during fine-tuning. This approach leads to strong gains in output quality, however, they do not compare model sizes, training or inference time directly with previous work.…”
Section: Related Workmentioning
confidence: 99%
“…Pre-training a text decoder can be done independently (e.g., GPT2 [26]) or jointly with an encoder for sequence-to-sequence tasks (e.g., mT5 [27], mBart [6]). With the former approach, after the pre-training phase, the text decoder needs to be fused with additional encoders or adapters [28], which increases the complexity of architectures. For the latter approach, the decoder component can usually be used as an individual module without architectural modifications.…”
Section: Pre-trained Modelsmentioning
confidence: 99%