2012
DOI: 10.1186/1687-4722-2012-21
|View full text |Cite
|
Sign up to set email alerts
|

Speaker-dependent model interpolation for statistical emotional speech synthesis

Abstract: In this article, we propose a speaker-dependent model interpolation method for statistical emotional speech synthesis. The basic idea is to combine the neutral model set of the target speaker and an emotional model set selected from a pool of speakers. For model selection and interpolation weight determination, we propose to use a novel monophone-based Mahalanobis distance, which is a proper distance measure between two Hidden Markov Model sets. We design Latin-square evaluation to reduce the systematic bias i… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2015
2015
2019
2019

Publication Types

Select...
3
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(4 citation statements)
references
References 18 publications
0
4
0
Order By: Relevance
“…The underlying methodology of previous studies is a lexicon-based method by which a lexicon is used to detect emotions in text based on speech recognition (Pajupuu et al 2012;Kato et al 2006). Using a single modal of speech recognition, the correct emotion recognition reaches 73% for happiness, 60% for angry, 55% for sadness, and the overall accuracy is 62% for all emotions (Hsu and Chen 2012). However, bimodal emotion recognition can reach an accuracy of 86.85%, an increase of 5% compared with using a single modal of emotion recognition (Song et al 2015;Chuang and Wu 2004;Kessous et al 2010).…”
Section: Methodsmentioning
confidence: 99%
“…The underlying methodology of previous studies is a lexicon-based method by which a lexicon is used to detect emotions in text based on speech recognition (Pajupuu et al 2012;Kato et al 2006). Using a single modal of speech recognition, the correct emotion recognition reaches 73% for happiness, 60% for angry, 55% for sadness, and the overall accuracy is 62% for all emotions (Hsu and Chen 2012). However, bimodal emotion recognition can reach an accuracy of 86.85%, an increase of 5% compared with using a single modal of emotion recognition (Song et al 2015;Chuang and Wu 2004;Kessous et al 2010).…”
Section: Methodsmentioning
confidence: 99%
“…Therefore, the digital human designers in future studies can use the digital human users in future studies to determine the effects of individually customised designs on user behaviours, functions, or structures. As a result, the digital human designers in future studies can also use the digital human users in future studies to fully automate the design process (Bekey, 1998;Bicego, 2005;Carruth et al, 2007;Choi et al, 2008;De Magistris et al, 2013;Duffy, 2007a, 2007b;Hardy et al, 1984;Hsu and Chen, 2012;Ishihara et al, 2005;Jung et al, 2009;Kao and Smith, 2011;Kuffner et al, 2003;Li and Zhang, 2007;Liu et al, 2010;Lu and Smith, 2006;Lu et al, 2010;Maldonado-Bascon, 2007;May et al, 2011;Miller et al, 2010;Minami et al, 1994;Noble et al, 2012;Oguri et al, 2000;Rohrer, 2007;Smith and Smith, submitted;Smith and Yen, 2010;Smith et al, 2012aSmith et al, , 2012bTaish et al, 2011;Tokuda et al, 1998;van den Broek, 2010;Zhang and Tan, 2013;Zimmer and Miteran, 2001).…”
Section: Future Studiesmentioning
confidence: 99%
“…On the other hand, the level of expressive strength or speaker similarity cannot be guaranteed as the transplantation reach is very constrained. This is also the case for model interpolation techniques (Hsu et al, 2012), capable of achieving better expressiveness than traditional adaptation techniques at a cost in speaker similarity.…”
Section: Introductionmentioning
confidence: 99%