ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9414080
|View full text |Cite
|
Sign up to set email alerts
|

Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Text Data

Abstract: This paper presents a method to pre-train transformer-based encoder-decoder automatic speech recognition (ASR) models using sufficient target-domain text. During pre-training, we train the transformer decoder as a conditional language model with empty or artifical states, rather than the real encoder states. By this pre-training strategy, the decoder can learn how to generate grammatical text sequence before learning how to generate correct transcriptions. Contrast to other methods which utilize text only data… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
8
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
6
1

Relationship

0
7

Authors

Journals

citations
Cited by 14 publications
(8 citation statements)
references
References 15 publications
0
8
0
Order By: Relevance
“…In addition, an idea was explored to pre-train a decoder for end-to-end ASR [4,14,15]. The authors in [4] employ a single speaker text to speech (TTS) system to generate synthesized speech from a large number of transcripts, and use the generated speech-text pairs to pre-train the decoder.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…In addition, an idea was explored to pre-train a decoder for end-to-end ASR [4,14,15]. The authors in [4] employ a single speaker text to speech (TTS) system to generate synthesized speech from a large number of transcripts, and use the generated speech-text pairs to pre-train the decoder.…”
Section: Related Workmentioning
confidence: 99%
“…The authors in [4] employ a single speaker text to speech (TTS) system to generate synthesized speech from a large number of transcripts, and use the generated speech-text pairs to pre-train the decoder. In [14], unpaired text data are used to pre-train the transformer decoder, which is pre-trained as a conditional language model by constructing empty or artificial states to replace the real encoder hidden states. Leveraging large-scale unpaired speech and text data, SpeechT5 [15] pre-trains a shared encoder-decoder model for various spoken language tasks.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Recently, how to use unpaired text data that is much easier to collect compared with paired speech-text data to improve S2S architecture has attracted more and more attention [33]. They try to solve the problem that the decoder depends on the encoder output and thus cannot be separately pretrained [13,34]. For example, Gao et al [13] pretrain the decoder on unpaired text data with empty or artificial states instead of real encoder states, which not only fails to use those already pretrained LMs like GPT2, but the result is very limited due to the input state mismatch between pre-training and fine-tuning.…”
Section: One-cross Decoder (Ocd)mentioning
confidence: 99%
“…Furthermore, the decoder of S2S ASR models cannot be pre-trained separately because of its dependency on acoustics representations. And it is hard to utilize pre-trained LMs like BERT [8] or GPT2 [12] for parameter initialization due to architecture mismatch [13]. Therefore, how to efficiently utilize the pretrained acoustic models (AMs) and language models (LMs) in dominant S2S ASR models still remains a valuable challenge.…”
Section: Introductionmentioning
confidence: 99%