2020
DOI: 10.48550/arxiv.2010.11445
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

MAM: Masked Acoustic Modeling for End-to-End Speech-to-Text Translation

Junkun Chen,
Mingbo Ma,
Renjie Zheng
et al.

Abstract: End-to-end Speech-to-text Translation (E2E-ST), which directly translates source language speech to target language text, is widely useful in practice, but traditional cascaded approaches (ASR+MT) often suffer from error propagation in the pipeline. On the other hand, existing end-to-end solutions heavily depend on the source language transcriptions for pre-training or multi-task training with Automatic Speech Recognition (ASR). We instead propose a simple technique to learn a robust speech encoder in a self-s… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
7
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
2

Relationship

1
1

Authors

Journals

citations
Cited by 2 publications
(7 citation statements)
references
References 19 publications
0
7
0
Order By: Relevance
“…Although the above studies of pre-training with labeled ASR and MT data can accelerate the model convergence and boost the translation quality of ST, parallel ASR data and MT data are still limited, so many works attempt to pre-train an ST model with large-scale unlabeled speech or text data [1268,1269,1270,1257]. Compared to text representation learning, there are some challenges in self-supervised approaches for speech representation learning because speech signals are continuous-valued sequences.…”
Section: Pre-training With Unlabeled Speech/text Datamentioning
confidence: 99%
See 2 more Smart Citations
“…Although the above studies of pre-training with labeled ASR and MT data can accelerate the model convergence and boost the translation quality of ST, parallel ASR data and MT data are still limited, so many works attempt to pre-train an ST model with large-scale unlabeled speech or text data [1268,1269,1270,1257]. Compared to text representation learning, there are some challenges in self-supervised approaches for speech representation learning because speech signals are continuous-valued sequences.…”
Section: Pre-training With Unlabeled Speech/text Datamentioning
confidence: 99%
“…After pre-training, they input the representations produced by the BM to the ST encoder instead of MFCC and log Mel-filterbank in conventional methods. Another line of work explored a more direct approach by learning an ST encoder in a self-supervised fashion only on the speech side [1269,1257]. In [1269], they instead proposed a simple technique to learn a robust speech encoder in a self-supervised fashion only on the speech side, which can utilize speech data without transcription.…”
Section: Pre-training With Unlabeled Speech/text Datamentioning
confidence: 99%
See 1 more Smart Citation
“…Recently, speech representation learning has attracted much attention in the speech community due to its strong performance to many speech-related downstream tasks, such as speech recognition, speech classification, and speech translation (Baevski et al, 2020;Chen et al, 2020;Liu et al, 2020;Zheng et al, 2021;Hsu et al, 2021) However, all these efforts can only support speech understanding tasks which take speech as input, but for the inverse direction, speech synthesis, which synthesis speech as output, the potential of representation learning is yet to be 1 University of Waterloo, Waterloo, ON, Canada 2 Baidu Research, Sunnyvale, CA, USA 3 Oregon State University, Corvallis, OR, USA. Correspondence to: Renjie Zheng <renjiezheng@baidu.com>.…”
Section: Introductionmentioning
confidence: 99%
“…In this way, these models are good at recognizing and extracting discrete information from speech and successfully improves automatic speech recognition (ASR), but they are unable to generate continuous acoustic signals for speech synthesis. On the other hand, another line of work, such as MAM (Chen et al, 2020) and FAT-MLM (Zheng et al, 2021), show that reconstructing masked spectrogram with continuous units can improve speech-to-text translation. However, the quality of their proposed speech reconstruction is far from the requirement of speech synthesis tasks (see Fig.…”
Section: Introductionmentioning
confidence: 99%