ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9054681
|View full text |Cite
|
Sign up to set email alerts
|

Teacher-Student Training For Robust Tacotron-Based TTS

Abstract: While neural end-to-end text-to-speech (TTS) is superior to conventional statistical methods in many ways, the exposure bias problem in the autoregressive models remains an issue to be resolved. The exposure bias problem arises from the mismatch between the training and inference process, that results in unpredictable performance for out-of-domain test data at run-time. To overcome this, we propose a teacher-student training scheme for Tacotron-based TTS by introducing a distillation loss function in addition … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
30
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
4
4
1

Relationship

4
5

Authors

Journals

citations
Cited by 44 publications
(30 citation statements)
references
References 31 publications
0
30
0
Order By: Relevance
“…Encoder-decoder models with attention have recently shown considerable success in modeling a variety of complex sequence-to-sequence problems. Tacotron [89], [180], [218]- [220] represents one of the successful text-to-speech (TTS) implementations, that has been extended to voice conversion [3], [183], [221]. The strategy to leverage TTS knowledge is built on the ideas of shared attention knowledge and/or shared decoder architecture as illustrated in Figure 8.…”
Section: ) Non-parallel Data Of Paired Speakersmentioning
confidence: 99%
“…Encoder-decoder models with attention have recently shown considerable success in modeling a variety of complex sequence-to-sequence problems. Tacotron [89], [180], [218]- [220] represents one of the successful text-to-speech (TTS) implementations, that has been extended to voice conversion [3], [183], [221]. The strategy to leverage TTS knowledge is built on the ideas of shared attention knowledge and/or shared decoder architecture as illustrated in Figure 8.…”
Section: ) Non-parallel Data Of Paired Speakersmentioning
confidence: 99%
“…Just like most of other TTS systems, Tacotron [3] is trained to predict the Mel spectrum features from input sequence of characters. Prosody, if taken into consideration, is modeled from the statistics of the training data [3], [4], [6], [8]. We note that the character sequences themselves are not the most suitable for describing prosody.…”
Section: Tacotron-based Ttsmentioning
confidence: 99%
“…A TTS system is expected to synthesize the right prosodic pattern at the right time. However, most of the current end-to-end systems [3], [4], [6], [7] have not explicitly modeled speech prosody. Therefore, they can't control well the melodic and rhythmic aspects of the generated speech.…”
Section: Introductionmentioning
confidence: 99%
“…Any errors made in the phrase breaking are propagated to other downstream prosodic models, resulting in unnatural speech [10,11]. Nonetheless, some newly developed speech synthesis systems, such as Tacotron [12][13][14][15][16][17][18][19][20], WaveNetbased approaches [21][22][23][24][25][26][27], and Deep Voice [28] have not specifically modeled prosodic cues from input text. Therefore, they cannot explicitly control prosodic phrasing [29].…”
Section: Introductionmentioning
confidence: 99%