Multilingual Translation via Grafting Pre-trained Language Models

Sun, Zewei; Wang, Mingxuan; Li, Lei

doi:10.18653/v1/2021.findings-emnlp.233

Cited by 12 publications

(12 citation statements)

References 42 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Similar to the modern ST architectures (Gállego et al, 2021;, we use pretrained W2V2 large as our encoder and pretrained mBART50 decoder as the decoder. We randomly initialize the top 3 layers of W2V2 in experiments involving RedApt and find that it enables faster convergence, verifying earlier observations by Sun et al (2021). We free W2V2 feature extractor.…”

Section: Experimental Settingssupporting

confidence: 72%

RedApt: An Adaptor for wav2vec 2 EncodingFaster and Smaller Speech Translation without Quality Compromise

Zhao¹,

Hao²,

Haffari³

et al. 2022

Findings of the Association for Computational Linguistics: EMNLP 2022

View full text Add to dashboard Cite

Pre-trained speech Transformers in speech translation (ST) have facilitated state-of-theart (SotA) results; yet, using such encoders is computationally expensive. To improve this, we present a novel Reducer Adaptor block, RedApt, that could be seamlessly integrated within any Transformer-based speech encoding architecture. Integrating the pretrained WAV2VEC 2 speech encoder with RedApt brings 41% speedup, 33% memory reduction with 24% fewer FLOPs at inference. To our positive surprise, our ST model with RedApt outperforms the SotA architecture by an average of 0.68 BLEU score on 8 language pairs from Must-C.

show abstract

Section: Experimental Settingssupporting

confidence: 72%

RedApt: An Adaptor for wav2vec 2 EncodingFaster and Smaller Speech Translation without Quality Compromise

Zhao¹,

Hao²,

Haffari³

et al. 2022

Findings of the Association for Computational Linguistics: EMNLP 2022

View full text Add to dashboard Cite

show abstract

“…Sun et al [19] proposed a grafting approach, where they stack multiple sets of BERT encoder layers that follow autoregressive manner and GPT decoder decoder layers, and freeze selections of encoder or decoder parameters during fine-tuning. This approach leads to strong gains in output quality, however, they do not compare model sizes, training or inference time directly with previous work.…”

Section: Related Workmentioning

confidence: 99%

BERT-NAR-BERT: A Non-Autoregressive Pre-Trained Sequence-to-Sequence Model Leveraging BERT Checkpoints

Sohrab,

Asada,

Rikters

et al. 2024

IEEE Access

View full text Add to dashboard Cite

show abstract

“…Pre-training a text decoder can be done independently (e.g., GPT2 [26]) or jointly with an encoder for sequence-to-sequence tasks (e.g., mT5 [27], mBart [6]). With the former approach, after the pre-training phase, the text decoder needs to be fused with additional encoders or adapters [28], which increases the complexity of architectures. For the latter approach, the decoder component can usually be used as an individual module without architectural modifications.…”

Section: Pre-trained Modelsmentioning

confidence: 99%

M-Adapter: Modality Adaptation for End-to-End Speech-to-Text Translation

Zhao¹,

Hao²,

Shareghi³

et al. 2022

Interspeech 2022

View full text Add to dashboard Cite

End-to-end speech-to-text translation models are often initialized with pre-trained speech encoder and pre-trained text decoder. This leads to a significant training gap between pretraining and fine-tuning, largely due to the modality differences between speech outputs from the encoder and text inputs to the decoder. In this work, we aim to bridge the modality gap between speech and text to improve translation quality. We propose M-Adapter, a novel Transformer-based module, to adapt speech representations to text. While shrinking the speech sequence, M-Adapter produces features desired for speech-to-text translation via modelling global and local dependencies of a speech sequence. Our experimental results show that our model outperforms a strong baseline by up to 1 BLEU score on the Must-C En→DE dataset. 1

show abstract

Multilingual Translation via Grafting Pre-trained Language Models

Cited by 12 publications

References 42 publications

RedApt: An Adaptor for wav2vec 2 EncodingFaster and Smaller Speech Translation without Quality Compromise

RedApt: An Adaptor for wav2vec 2 EncodingFaster and Smaller Speech Translation without Quality Compromise

BERT-NAR-BERT: A Non-Autoregressive Pre-Trained Sequence-to-Sequence Model Leveraging BERT Checkpoints

M-Adapter: Modality Adaptation for End-to-End Speech-to-Text Translation

Contact Info

Product

Resources

About