Interspeech 2022 2022
DOI: 10.21437/interspeech.2022-10802
|View full text |Cite
|
Sign up to set email alerts
|

Strategies for developing a Conversational Speech Dataset for Text-To-Speech Synthesis

Abstract: There have been many efforts to improve the quality of speech synthesis systems in conversational AI. Although state-of-theart systems are capable of producing natural-sounding speech, the generated speech often lacks prosodic variation and is not always suited to the task. In this paper, we examine dialogue data collection methods to use as training data for our acoustic models. We collect speech using three different setups: (1) Random read-aloud sentences; (2) Performed dialogues; (3) Semi-Spontaneous dialo… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(1 citation statement)
references
References 35 publications
0
1
0
Order By: Relevance
“…Moreover, the prosody is generated solely from text, which allows no control over the generated style [5]. Concurrently, spontaneous speech is increasingly used in TTS [6,7]. Spontaneous speech data is challenging to model, due to disfluencies and large variability [8]; offers high ecological validity for evermore commonplace conversational AI systems, and the varied prosody offered by spontaneous data positively impacts factors such as word recall and attention [9].…”
Section: Introductionmentioning
confidence: 99%
“…Moreover, the prosody is generated solely from text, which allows no control over the generated style [5]. Concurrently, spontaneous speech is increasingly used in TTS [6,7]. Spontaneous speech data is challenging to model, due to disfluencies and large variability [8]; offers high ecological validity for evermore commonplace conversational AI systems, and the varied prosody offered by spontaneous data positively impacts factors such as word recall and attention [9].…”
Section: Introductionmentioning
confidence: 99%