2020
DOI: 10.48550/arxiv.2002.00417
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

WaveTTS: Tacotron-based TTS with Joint Time-Frequency Domain Loss

Abstract: Tacotron-based text-to-speech (TTS) systems directly synthesize speech from text input. Such frameworks typically consist of a feature prediction network that maps character sequences to frequency-domain acoustic features, followed by a waveform reconstruction algorithm or a neural vocoder that generates the time-domain waveform from acoustic features. As the loss function is usually calculated only for frequency-domain acoustic features, that doesn't directly control the quality of the generated time-domain w… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2020
2020
2020
2020

Publication Types

Select...
2

Relationship

2
0

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 39 publications
0
2
0
Order By: Relevance
“…As illustrated in Figure 8, encoder-decoder models with attention have recently shown considerable success in modeling a variety of complex sequence-to-sequence problems. Tacotron [87], [176], [208] represents one of the successful text-to-speech (TTS) implementations, that has been extended to voice conversion [3], [179].…”
Section: ) Non-parallel Data Of Paired Speakersmentioning
confidence: 99%
“…As illustrated in Figure 8, encoder-decoder models with attention have recently shown considerable success in modeling a variety of complex sequence-to-sequence problems. Tacotron [87], [176], [208] represents one of the successful text-to-speech (TTS) implementations, that has been extended to voice conversion [3], [179].…”
Section: ) Non-parallel Data Of Paired Speakersmentioning
confidence: 99%
“…It plays an important role as a manifestation at semantic and pragmatic level of spoken languages. An adequate rendering of emotion in speech is critically important in expressive text-to-speech [2,3], personalized speech synthesis, and intelligent dialogue systems, such as social robots and conversational agents.…”
Section: Introductionmentioning
confidence: 99%