2018
DOI: 10.1016/j.specom.2018.03.002
|View full text |Cite
|
Sign up to set email alerts
|

Investigating different representations for modeling and controlling multiple emotions in DNN-based speech synthesis

Abstract: In this paper, we investigate the simultaneous modeling of multiple emotions in DNN-based expressive speech synthesis, and how to represent the emotional labels, such as emotional class and strength, for this task. Our goal is to answer two questions: First, what is the best way to annotate speech data with multiple emotions-should we use the labels that the speaker intended to express, or labels based on listener perception of the resulting speech signals? Second, how should the emotional information be repre… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
58
0

Year Published

2018
2018
2024
2024

Publication Types

Select...
4
3
3

Relationship

0
10

Authors

Journals

citations
Cited by 81 publications
(58 citation statements)
references
References 26 publications
0
58
0
Order By: Relevance
“…It has been shown that only ∼5 min of speech per style is sufficient in order to produce speech of acceptable quality in a specific style. Using input codes for representing different styles is also presented in [119, 120]. There have also been attempts at style transplantation, i.e., producing speech in the voice of speaker A in style X without having any sentence from speaker A in style X in the training data, in which case the network is forced to learn the style X from other speakers in the training database [121, 122].…”
Section: Progress In Speech Recognition and Synthesis As Well As mentioning
confidence: 99%
“…It has been shown that only ∼5 min of speech per style is sufficient in order to produce speech of acceptable quality in a specific style. Using input codes for representing different styles is also presented in [119, 120]. There have also been attempts at style transplantation, i.e., producing speech in the voice of speaker A in style X without having any sentence from speaker A in style X in the training data, in which case the network is forced to learn the style X from other speakers in the training database [121, 122].…”
Section: Progress In Speech Recognition and Synthesis As Well As mentioning
confidence: 99%
“…The proposed speaker representation learning algorithms extend these ideas to make DNNs learn the pairwise speakers' perceptual similarity rather than the conventional pointwise speaker's voice impression. Furthermore, one can model the relationship between a speaker's intention and listener's perception (e.g., difference in emotion perception [37]) by using the algorithms. Also, we can use the proposed speaker embeddings in more sophisticated speech synthesis frameworks, such as end-to-end multi-speaker TTS [21], multi-speaker multi-lingual TTS [38], and singing VC [39], instead of the conventional discriminative speaker embeddings.…”
Section: E Discussionmentioning
confidence: 99%
“…Many pioneering methods have been proposed for emotional TTS. [4] proposes a LSTM-based acoustic model for emotional TTS, where several kinds of emotional category labels such as one-hot vector or perception vector are used as an extra input to the acoustic model. [5] uses a improved tacotron [1] model for end-to-end emotional TTS, in which the emotion labels are concatenated to the output of both the decoder pre-net and the first decoder RNN layer.…”
Section: Introductionmentioning
confidence: 99%