2020 13th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI) 2020
DOI: 10.1109/cisp-bmei51763.2020.9263564
|View full text |Cite
|
Sign up to set email alerts
|

You Do Not Need More Data: Improving End-To-End Speech Recognition by Text-To-Speech Data Augmentation

Abstract: Data augmentation is one of the most effective ways to make end-to-end automatic speech recognition (ASR) perform close to the conventional hybrid approach, especially when dealing with low-resource tasks. Using recent advances in speech synthesis (text-to-speech, or TTS), we build our TTS system on an ASR training database and then extend the data with synthesized speech to train a recognition model. We argue that, when the training data amount is low, this approach can allow an end-to-end model to reach hybr… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
33
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
3
2

Relationship

1
8

Authors

Journals

citations
Cited by 50 publications
(38 citation statements)
references
References 29 publications
1
33
0
Order By: Relevance
“…As the quality of recent TTS systems has reached a natural level, several attempts have been made to apply TTS-synthesized speech databases to speech applications. For instance, Laptev et al [12] and Jia et al [13] improved the performance of automatic speech recognition and speech translation systems by training models with synthetic speech databases generated by Tacotron. In the TTS applications, Sharma et al [14] showed that the AR WaveNet-driven data augmentation is effective for improving the quality of Parallel WaveNet system [15].…”
Section: Relationship To Prior Workmentioning
confidence: 99%
“…As the quality of recent TTS systems has reached a natural level, several attempts have been made to apply TTS-synthesized speech databases to speech applications. For instance, Laptev et al [12] and Jia et al [13] improved the performance of automatic speech recognition and speech translation systems by training models with synthetic speech databases generated by Tacotron. In the TTS applications, Sharma et al [14] showed that the AR WaveNet-driven data augmentation is effective for improving the quality of Parallel WaveNet system [15].…”
Section: Relationship To Prior Workmentioning
confidence: 99%
“…Employing synthetic audio from TTS for ASR training recently gained popularity as a result of advancements in TTS. Recent research [11][12][13] has studied creating acoustically and lexically diverse synthetic data, exploring the feasibility of replacing or augmenting real recordings with synthetic data during ASR model training, without compromising recognition performance. The results show that synthetic audio can improve the training convergence when the amount of available real data is as small as 10 hours but does not yet replace real speech recordings to achieve the same recognition performance given the same text sources [11].…”
Section: Related Workmentioning
confidence: 99%
“…A multi-head self-attention (MHA) mechanism significantly improved the quality of models over the recurrent models. A transformer model, trained with CTC-Attention, can outperform neural transducer systems (e.g., [ 36 ]) and benefit from various augmentation techniques [ 37 ]. Recently, the Conformer [ 38 ] was introduced, which is a modification of the transformer layer.…”
Section: Asr Modelingmentioning
confidence: 99%