Interspeech 2015 2015
DOI: 10.21437/interspeech.2015-116
|View full text |Cite
|
Sign up to set email alerts
|

Emotional transplant in statistical speech synthesis based on emotion additive model

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
8
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 9 publications
(8 citation statements)
references
References 8 publications
0
8
0
Order By: Relevance
“…The outputs of the emotion-dependent part and speaker-dependent part are summed linearly, because the linear activation function is used at the output layer. The PM is newly proposed and is motivated by a multi-speaker DNN [6] and the emotion additive model [17], where hidden layers are regarded as a linguistic feature transformation shared by all speakers [23]. Because the acoustic feature is represented as the addition of the emotionaldependent part, and the speaker-dependent part, the emotional factor and speaker factor are separately controlled.…”
Section: Parallel Modelmentioning
confidence: 99%
See 1 more Smart Citation
“…The outputs of the emotion-dependent part and speaker-dependent part are summed linearly, because the linear activation function is used at the output layer. The PM is newly proposed and is motivated by a multi-speaker DNN [6] and the emotion additive model [17], where hidden layers are regarded as a linguistic feature transformation shared by all speakers [23]. Because the acoustic feature is represented as the addition of the emotionaldependent part, and the speaker-dependent part, the emotional factor and speaker factor are separately controlled.…”
Section: Parallel Modelmentioning
confidence: 99%
“…The proposal included using a constrained structural maximum a posteriori linear regression (CSMAPLR) algorithm [16]. Ohtani et al proposed an emotion additive model to extrapolate emotional expression for a neutral voice [17]. All the aforementioned methods above suggest that the extrapolation of emotional expressions is possible by separately modeling the emotional expressions and the speaker identities.…”
Section: Introductionmentioning
confidence: 99%
“…Proper expression rendering affects overall speech perception, which is important for applications such as audiobooks and newsreaders. In particular, emotional speech synthesis, which focuses on emotion expression rendering, has drawn much attention recently [11][12][13][14][15]. The emotional expressions are directly affected by the speaker's intentions, leading to speech with different emotion categories such as happy, angry, sad and fear.…”
Section: Introductionmentioning
confidence: 99%
“…As a part of the important information conveyed by human speech, emotional expressions are directly affected by the speaker's intentions that may lead to different emotions, e.g., 𝑓 π‘’π‘Žπ‘Ÿ, π‘Žπ‘›π‘”π‘Ÿ 𝑦, β„Žπ‘Ž 𝑝 𝑝𝑦, π‘ π‘Žπ‘‘, π‘ π‘’π‘Ÿ π‘π‘Ÿπ‘–π‘ π‘’ and 𝑑𝑖𝑠𝑔𝑒𝑠𝑑. Therefore, how to present appropriate emotions in synthetic speech is important in building diverse audio generation systems and immersive human-computer interaction systems [12], [13], [14], [15], [16], and thus has been drawn much attention recently [17], [18], [19], [20], [21], [22].…”
mentioning
confidence: 99%
“…In the same-speaker scenario, to synthesize emotional speech of a single speaker, a straightforward way is to train a TTS model with categorized emotional data [23] if sizable emotional data is available. Besides, there are also several other methods to achieve this goal, e.g., model adaptation on a base model using a small amount of emotional data [24], [25] and code/embedding-based methods [19], [26], [27]. However, the weakness of these same-speaker methods is obvious.…”
mentioning
confidence: 99%