2020
DOI: 10.48550/arxiv.2010.13350
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Emotion controllable speech synthesis using emotion-unlabeled dataset with the assistance of cross-domain speech emotion recognition

Abstract: Neural text-to-speech (TTS) approaches generally require a huge number of high quality speech data, which makes it difficult to obtain such a dataset with extra emotion labels. In this paper, we propose a novel approach for emotional TTS synthesis on a TTS dataset without emotion labels. Specifically, our proposed method consists of a cross-domain speech emotion recognition (SER) model and an emotional TTS model. Firstly, we train the cross-domain SER model on both SER and TTS datasets. Then, we use emotion la… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2

Citation Types

0
2
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 22 publications
(8 reference statements)
0
2
0
Order By: Relevance
“…Generally, with a well-trained neural acoustic model [2,5,6,7] and a neural vocoder [8,9,10,11], or alternatively using fully end-to-end models [12,13,14] which directly construct wave signals from text input, it is able to synthesize high-quality neutral speech. Recently, much attention has been attracted to synthesizing expressive speech, such as stylized speech [15,16], emotional speech [17,18,19,20,21,22], and also singing voice [23,24].…”
Section: Introductionmentioning
confidence: 99%
“…Generally, with a well-trained neural acoustic model [2,5,6,7] and a neural vocoder [8,9,10,11], or alternatively using fully end-to-end models [12,13,14] which directly construct wave signals from text input, it is able to synthesize high-quality neutral speech. Recently, much attention has been attracted to synthesizing expressive speech, such as stylized speech [15,16], emotional speech [17,18,19,20,21,22], and also singing voice [23,24].…”
Section: Introductionmentioning
confidence: 99%
“…Wu's work [31] uses the weight of the attention layer in the GST as the probability of each emotion label to calculate the crossentropy loss and generate the emotion token. Some studies [32] use the speech emotion recognition module as an alternative approach to modeling the emotion feature from existing audio. Several other studies [33,34] use VAE or DurIAN to model the emotional information and have achieved some progress.…”
Section: Introductionmentioning
confidence: 99%