2015
DOI: 10.1109/taslp.2014.2385478
|View full text |Cite
|
Sign up to set email alerts
|

Speaker and Expression Factorization for Audiobook Data: Expressiveness and Transplantation

Abstract: Expressive synthesis from text is a challenging problem. There are two issues. First, read text is often highly expressive to convey the emotion and scenario in the text. Second, since the expressive training speech is not always available for different speakers, it is necessary to develop methods to share the expressive information over speakers. This paper investigates the approach of using very expressive, highly diverse audiobook data from multiple speakers to build an expressive speech synthesis system. B… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
7
0

Year Published

2015
2015
2021
2021

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 9 publications
(7 citation statements)
references
References 27 publications
0
7
0
Order By: Relevance
“…In the case of the unit selection synthesis, there are 6 different versions, 3 female (neutral, lax and tense) and 3 male (neutral, lax and tense). considering how to effectively model voice quality within parametric speech synthesis, e.g., [48], [49], [50], [51], [52], however this is atypical and in the work presented here the parametric system that is evaluated is very close to the design described in Zen et al [22]. The approach does not directly model voice quality, but includes mixed excitation where noise representing frication is added to a pulse train to generate speech.…”
Section: Synthetic Speech Materialsmentioning
confidence: 99%
“…In the case of the unit selection synthesis, there are 6 different versions, 3 female (neutral, lax and tense) and 3 male (neutral, lax and tense). considering how to effectively model voice quality within parametric speech synthesis, e.g., [48], [49], [50], [51], [52], however this is atypical and in the work presented here the parametric system that is evaluated is very close to the design described in Zen et al [22]. The approach does not directly model voice quality, but includes mixed excitation where noise representing frication is added to a pulse train to generate speech.…”
Section: Synthetic Speech Materialsmentioning
confidence: 99%
“…This solution requires as many speech databases as required speaking styles and raises the issue of the consistency between semantics and expressivity. A speaker and expressivity factorization could help to solve this problem [21]. Otherwise, expressivity can also be controlled in symbolic terms (diphone identity, position, etc.)…”
Section: Generation Of Expressive Speechmentioning
confidence: 99%
“…Because of the success of end-to-end text-to-speech (E2E-TTS) models, researchers have been trying to expand this framework to synthesize more expressive speech [9]- [13]. Unlike emotionally neutral speech (narrative speech) which has monotonic prosody, expressive speech has many variations in prosody.…”
Section: Introductionmentioning
confidence: 99%