2005
DOI: 10.1093/ietisy/e88-d.3.510
|View full text |Cite
|
Sign up to set email alerts
|

Modeling Improved Prosody Generation from High-Level Linguistically Annotated Corpora

Abstract: SUMMARY Synthetic speech usually suffers from bad F0 contour surface. The prediction of the underlying pitch targets robustly relies on the quality of the predicted prosodic structures, i.e. the corresponding sequences of tones and breaks. In the present work, we have utilized a linguistically enriched annotated corpus to build data-driven models for predicting prosodic structures with increased accuracy. We have then used a linear regression approach for the F0 modeling. An appropriate XML annotation scheme h… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
8
0

Year Published

2005
2005
2021
2021

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 13 publications
(8 citation statements)
references
References 18 publications
0
8
0
Order By: Relevance
“…Further improvements in the tree-structured predictors can be achieved by the introduction of more delicate linguistic features as has been inspected on [16] for the CART approach.…”
Section: Discussionmentioning
confidence: 99%
“…Further improvements in the tree-structured predictors can be achieved by the introduction of more delicate linguistic features as has been inspected on [16] for the CART approach.…”
Section: Discussionmentioning
confidence: 99%
“…The audio recordings were digitized through the audio card. DEMOSTHe´NES is a modular and scalable (Xydas & Kouroupetroglou, 2001a), multilingual and polyglot (Xydas & Kouroupetroglou, 2001b) TTS system that supports Greek and English with various voices and incorporates advanced speech synthesis methodologies in order to produce almost natural pitch and prosody (Xydas, Spiliotopoulos, & Kouroupetroglou, 2005;Xydas, Zervas, Kouroupetroglou, Fakotakis, & Kokkinakis, 2005). The DEMOSTHe´NES TTS system used in this study incorporates a diphone-based MBROLA (Multi-Band Resynthesis OverLap and Process) synthesis technique (Dutoit, Pagel, Pierret, Bataille, & van Der Vreken, 1996), with the female voice and the following parameters: pitch ¼ 210Hz; speed ¼ 95 wpm.…”
Section: Methodsmentioning
confidence: 99%
“…The labelling of the intonational phenomena had been conducted main ly by listening to the recorded utterance in conjunction to observation of amplitude and pitch contour of the speech signal. The annotator"s transcription consistency was further evaluated by cross checking statistically our data with a prosodic corpus constructed at the University of Athens for speech synthesis purposes [32].…”
Section: The Grtobi Prosody Annotation Systemmentioning
confidence: 99%