Interspeech 2022 2022
DOI: 10.21437/interspeech.2022-592
|View full text |Cite
|
Sign up to set email alerts
|

M-Adapter: Modality Adaptation for End-to-End Speech-to-Text Translation

Abstract: End-to-end speech-to-text translation models are often initialized with pre-trained speech encoder and pre-trained text decoder. This leads to a significant training gap between pretraining and fine-tuning, largely due to the modality differences between speech outputs from the encoder and text inputs to the decoder. In this work, we aim to bridge the modality gap between speech and text to improve translation quality. We propose M-Adapter, a novel Transformer-based module, to adapt speech representations to t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(2 citation statements)
references
References 28 publications
0
2
0
Order By: Relevance
“…It processes the audio features obtained by applying 80-dimensional Mel filterbanks to the audio waveform. The W2V-BERT encoder is followed by a Length Adapter based on a modified version of the M-adaptor (Zhao et al, 2022), which is a Transformer-based model (Vaswani et al, 2017) that is in charge of compressing the speech representation (by a factor of 8) through attention pooling. The compressed input representations are then fed to the NLLB decoder, in its 1.3B parameters configuration, to produce the translations.…”
Section: Simulseamlessmentioning
confidence: 99%
“…It processes the audio features obtained by applying 80-dimensional Mel filterbanks to the audio waveform. The W2V-BERT encoder is followed by a Length Adapter based on a modified version of the M-adaptor (Zhao et al, 2022), which is a Transformer-based model (Vaswani et al, 2017) that is in charge of compressing the speech representation (by a factor of 8) through attention pooling. The compressed input representations are then fed to the NLLB decoder, in its 1.3B parameters configuration, to produce the translations.…”
Section: Simulseamlessmentioning
confidence: 99%
“…2 Training batch size for a modern ST system (Gállego et al, 2021) could not exceed 1 on a V100 16GB GPU. representation length, and Zhao et al (2022) proposed a Transformer-based adaptor to shrink a sequence. Yet, the complexity of encoding remains high.…”
Section: Introductionmentioning
confidence: 99%