Speech Prosody 2020 2020
DOI: 10.21437/speechprosody.2020-192
|View full text |Cite
|
Sign up to set email alerts
|

Prosodic Prominence and Boundaries in Sequence-to-Sequence Speech Synthesis

Abstract: Recent advances in deep learning methods have elevated synthetic speech quality to human level, and the field is now moving towards addressing prosodic variation in synthetic speech. Despite successes in this effort, the state-of-the-art systems fall short of faithfully reproducing local prosodic events that give rise to, e.g., word-level emphasis and phrasal structure. This type of prosodic variation often reflects long-distance semantic relationships that are not accessible for end-to-end systems with a sing… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
6
1

Relationship

0
7

Authors

Journals

citations
Cited by 10 publications
(4 citation statements)
references
References 18 publications
0
4
0
Order By: Relevance
“…Ye et al (2023) show that the prosody pattern reflecting relevant text context contributes to the enhanced TTS. Suni et al (2020); Nguyen et al (2020) suggest that prosodic boundary plays an important role in the naturalness and intelligibility of speech. To incorporate an inherent prosodic structure of input utterance within its context, we build Prosodic Boundary Detector (PBD) trained on a large corpus along with prominence labels.…”
Section: Prosody Phrasingmentioning
confidence: 99%
“…Ye et al (2023) show that the prosody pattern reflecting relevant text context contributes to the enhanced TTS. Suni et al (2020); Nguyen et al (2020) suggest that prosodic boundary plays an important role in the naturalness and intelligibility of speech. To incorporate an inherent prosodic structure of input utterance within its context, we build Prosodic Boundary Detector (PBD) trained on a large corpus along with prominence labels.…”
Section: Prosody Phrasingmentioning
confidence: 99%
“…Another controllable aspect is the speaker voice, introduced through additional speakeremebeddings extracted through a speaker verification network [13]. Finally, [27] proposed a methodology to control the prominence and boundaries by automatically deriving prosodic tags to augment the input character sequence. It is also possible to combine multiple techniques into a single conditioned architecture, as shown by [26].…”
Section: Related Workmentioning
confidence: 99%
“…This is the solution we decided to follow for the VC. Additionally, to further refine speech synthesis, we considered augmenting the text to be pronounced with prosodic clues (predicted by the empathetic controller) [20] useful, for example, for putting the focus on specific words.…”
Section: Output Conditioning Modulesmentioning
confidence: 99%