Emotional transplant in statistical speech synthesis based on emotion additive model

Ohtani, Yamato; Nasu, Yu; Morita, Masahiro

doi:10.21437/interspeech.2015-116

Cited by 9 publications

(8 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The outputs of the emotion-dependent part and speaker-dependent part are summed linearly, because the linear activation function is used at the output layer. The PM is newly proposed and is motivated by a multi-speaker DNN [6] and the emotion additive model [17], where hidden layers are regarded as a linguistic feature transformation shared by all speakers [23]. Because the acoustic feature is represented as the addition of the emotionaldependent part, and the speaker-dependent part, the emotional factor and speaker factor are separately controlled.…”

Section: Parallel Modelmentioning

confidence: 99%

“…The proposal included using a constrained structural maximum a posteriori linear regression (CSMAPLR) algorithm [16]. Ohtani et al proposed an emotion additive model to extrapolate emotional expression for a neutral voice [17]. All the aforementioned methods above suggest that the extrapolation of emotional expressions is possible by separately modeling the emotional expressions and the speaker identities.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Model architectures to extrapolate emotional expressions in DNN-based text-to-speech

Inoue,

Hara,

Abe

et al. 2021

Preprint

View full text Add to dashboard Cite

This paper proposes architectures that facilitate the extrapolation of emotional expressions in deep neural network (DNN)based text-to-speech (TTS). In this study, the meaning of "extrapolate emotional expressions" is to borrow emotional expressions from others, and the collection of emotional speech uttered by target speakers is unnecessary. Although a DNN has potential power to construct DNN-based TTS with emotional expressions and some DNN-based TTS systems have demonstrated satisfactory performances in the expression of the diversity of human speech, it is necessary and troublesome to collect emotional speech uttered by target speakers. To solve this issue, we propose architectures to separately train the speaker feature and the emotional feature and to synthesize speech with any combined quality of speakers and emotions. The architectures are parallel model (PM), serial model (SM), auxiliary input model (AIM), and hybrid models (PM&AIM and SM&AIM). These models are trained through emotional speech uttered by few speakers and neutral speech uttered by many speakers. Objective evaluations demonstrate that the performances in the open-emotion test provide insufficient information. They make a comparison with those in the closed-emotion test, but each speaker has their own manner of expressing emotion. However, subjective evaluation results indicate that the proposed models could convey emotional information to some extent. Notably, the PM can correctly convey sad and joyful emotions at a rate of >60%.

show abstract

Section: Parallel Modelmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Model architectures to extrapolate emotional expressions in DNN-based text-to-speech

Inoue,

Hara,

Abe

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Proper expression rendering affects overall speech perception, which is important for applications such as audiobooks and newsreaders. In particular, emotional speech synthesis, which focuses on emotion expression rendering, has drawn much attention recently [11][12][13][14][15]. The emotional expressions are directly affected by the speaker's intentions, leading to speech with different emotion categories such as happy, angry, sad and fear.…”

Section: Introductionmentioning

confidence: 99%

Controllable Emotion Transfer For End-to-End Speech Synthesis

Yang

Xue

et al. 2020

Preprint

View full text Add to dashboard Cite

Emotion embedding space learned from references is a straightforward approach for emotion transfer in encoder-decoder structured emotional text to speech (TTS) systems. However, the transferred emotion in the synthetic speech is not accurate and expressive enough with emotion category confusions. Moreover, it is hard to select an appropriate reference to deliver desired emotion strength. To solve these problems, we propose a novel approach based on Tacotron. First, we plug two emotion classifiers -one after the reference encoder, one after the decoder output -to enhance the emotion-discriminative ability of the emotion embedding and the predicted mel-spectrum. Second, we adopt style loss to measure the difference between the generated and reference mel-spectrum. The emotion strength in the synthetic speech can be controlled by adjusting the value of the emotion embedding as the emotion embedding can be viewed as the feature map of the mel-spectrum. Experiments on emotion transfer and strength control have shown that the synthetic speech of the proposed method is more accurate and expressive with less emotion category confusions and the control of emotion strength is more salient to listeners.

show abstract

“…As a part of the important information conveyed by human speech, emotional expressions are directly affected by the speaker's intentions that may lead to different emotions, e.g., 𝑓 𝑒𝑎𝑟, 𝑎𝑛𝑔𝑟 𝑦, ℎ𝑎 𝑝 𝑝𝑦, 𝑠𝑎𝑑, 𝑠𝑢𝑟 𝑝𝑟𝑖𝑠𝑒 and 𝑑𝑖𝑠𝑔𝑢𝑠𝑡. Therefore, how to present appropriate emotions in synthetic speech is important in building diverse audio generation systems and immersive human-computer interaction systems [12], [13], [14], [15], [16], and thus has been drawn much attention recently [17], [18], [19], [20], [21], [22].…”

mentioning

confidence: 99%

“…In the same-speaker scenario, to synthesize emotional speech of a single speaker, a straightforward way is to train a TTS model with categorized emotional data [23] if sizable emotional data is available. Besides, there are also several other methods to achieve this goal, e.g., model adaptation on a base model using a small amount of emotional data [24], [25] and code/embedding-based methods [19], [26], [27]. However, the weakness of these same-speaker methods is obvious.…”

mentioning

confidence: 99%

Cross-speaker emotion disentangling and transfer for end-to-end speech synthesis

Li¹,

Wang²,

Xie³

et al. 2021

Preprint

View full text Add to dashboard Cite

The cross-speaker emotion transfer task in text-tospeech (TTS) synthesis particularly aims to synthesize speech for a target speaker with the emotion transferred from reference speech recorded by another (source) speaker. During the emotion transfer process, the identity information of the source speaker could also affect the synthesized results, resulting in the issue of speaker leakage, i.e., synthetic speech may have the voice identity of the source speaker rather than the target speaker. This paper proposes a new method with the aim to synthesize controllable emotional expressive speech and meanwhile maintain the target speaker's identity in the cross-speaker emotion TTS task. The proposed method is a Tacotron2-based framework with the emotion embedding as the conditioning variable to provide emotion information. Two emotion disentangling modules are contained in our method to 1) get speaker-independent and emotion-discriminative embedding, and 2) explicitly constrain the emotion and speaker identity of synthetic speech to be that as expected. Moreover, we present an intuitive method to control the emotional strength in the synthetic speech for the target speaker. Specifically, the learned emotion embedding is adjusted with a flexible scalar value, which allows controlling the emotion strength conveyed by the embedding. Extensive experiments have been conducted on a Mandarin disjoint corpus, and the results demonstrate that the proposed method is able to synthesize reasonable emotional speech for the target speaker. Compared to the state-of-the-art reference embedding learned methods, our method gets the best performance on the cross-speaker emotion transfer task, indicating that our method achieves the new state-of-the-art performance on learning the speaker-independent emotion embedding. Furthermore, the strength ranking test and pitch trajectories plots demonstrate that the proposed method can effectively control the emotion strength, leading to prosodydiverse synthetic speech.

show abstract

Emotional transplant in statistical speech synthesis based on emotion additive model

Cited by 9 publications

References 8 publications

Model architectures to extrapolate emotional expressions in DNN-based text-to-speech

Model architectures to extrapolate emotional expressions in DNN-based text-to-speech

Controllable Emotion Transfer For End-to-End Speech Synthesis

Cross-speaker emotion disentangling and transfer for end-to-end speech synthesis

Contact Info

Product

Resources

About