ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019
DOI: 10.1109/icassp.2019.8682168
|View full text |Cite
|
Sign up to set email alerts
|

Effect of Data Reduction on Sequence-to-sequence Neural TTS

Abstract: Recent speech synthesis systems based on sampling from autoregressive neural networks models can generate speech almost undistinguishable from human recordings. However, these models require large amounts of data. This paper shows that the lack of data from one speaker can be compensated with data from other speakers. The naturalness of Tacotron2-like models trained on a blend of 5k utterances from 7 speakers is better than that of speaker dependent models trained on 15k utterances, but in terms of stability m… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
30
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
8
1

Relationship

2
7

Authors

Journals

citations
Cited by 42 publications
(32 citation statements)
references
References 16 publications
1
30
0
Order By: Relevance
“…Each Tacotron was trained for 350k training steps. As [24] noted, neural S2S TTS models occasionally fail to stabilise, as differing random seeds influence the ability to learn effective alignments between text and speech. Some Tacotrons were unstable after training, therefore we trained each system with 3 different seeds and, after informal listening, selected the best performing for the MUSHRA.…”
Section: Tts Systemsmentioning
confidence: 99%
“…Each Tacotron was trained for 350k training steps. As [24] noted, neural S2S TTS models occasionally fail to stabilise, as differing random seeds influence the ability to learn effective alignments between text and speech. Some Tacotrons were unstable after training, therefore we trained each system with 3 different seeds and, after informal listening, selected the best performing for the MUSHRA.…”
Section: Tts Systemsmentioning
confidence: 99%
“…Voice banking is a simple idea of collecting a patient's speech samples before their speech becomes unintelligible and using it to build a personalized Text-To-Speech (TTS) voice. It requires about 1800 utterances for a basic unitselection TTS technology [14] and more than 5K utterances for building a Neural TTS voice [15]. Voice adaptation requires as little as 7 minutes of recordings.…”
Section: Speech Reconstructionmentioning
confidence: 99%
“…It uses a per-language encoder to process input sequences in terms of phonemes. [10] uses a phoneme-based representation with vowels using three different symbols depending on their level of stress.…”
Section: Introductionmentioning
confidence: 99%