10th ISCA Workshop on Speech Synthesis (SSW 10) 2019
DOI: 10.21437/ssw.2019-43
|View full text |Cite
|
Sign up to set email alerts
|

Using generative modelling to produce varied intonation for speech synthesis

Abstract: Unlike human speakers, typical text-to-speech (TTS) systems are unable to produce multiple distinct renditions of a given sentence. This has previously been addressed by adding explicit external control. In contrast, generative models are able to capture a distribution over multiple renditions and thus produce varied renditions using sampling.Typical neural TTS models learn the average of the data because they minimise mean squared error. In the context of prosody, taking the average produces flatter, more bor… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
17
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
3
1

Relationship

1
8

Authors

Journals

citations
Cited by 20 publications
(17 citation statements)
references
References 36 publications
0
17
0
Order By: Relevance
“…For example, if the end-to-end system has no overt model of stress, how might we control emphasis? Recent work [Hodari et al 2019, Skerry-Ryan et al 2018] explores this issue and proposes various solutions. However, from a practical perspective, when using an end-to-end speech synthesis system for a SIA it is important to be aware of what expressive speech control is available, if any, as well as the ability to correct errors in the synthesis produced.…”
Section: Designing Expressive Speech -Cross Speaker Features 1351 Languagementioning
confidence: 99%
“…For example, if the end-to-end system has no overt model of stress, how might we control emphasis? Recent work [Hodari et al 2019, Skerry-Ryan et al 2018] explores this issue and proposes various solutions. However, from a practical perspective, when using an end-to-end speech synthesis system for a SIA it is important to be aware of what expressive speech control is available, if any, as well as the ability to correct errors in the synthesis produced.…”
Section: Designing Expressive Speech -Cross Speaker Features 1351 Languagementioning
confidence: 99%
“…Human speech varies due to contextual as well as other more arbitrary factors such as prosody dynamics. There have been studies suggesting to add this idiosyncratic or dynamic variation to a "flat" artificial voice (Hodari et al, 2019). However, it is still unclear how much variability is needed and what is an optimal combination of intonational features (Velner et al, 2020).…”
Section: Qualitative Analysesmentioning
confidence: 99%
“…While the naturalness of state-of-the-art TTS is near identical to human speech [1], the prosody may not always be realistic [2]. Overall, prosody in TTS might be described as boring or flat [3], and can become fatiguing for listeners [4].…”
Section: Introductionmentioning
confidence: 99%