2021
DOI: 10.48550/arxiv.2106.09317
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional Text-to-Speech Model

Abstract: Recently, there has been an increasing interest in neural speech synthesis. While the deep neural network achieves the stateof-the-art result in text-to-speech (TTS) tasks, how to generate a more emotional and more expressive speech is becoming a new challenge to researchers due to the scarcity of high-quality emotion speech dataset and the lack of advanced emotional TTS model. In this paper, we first briefly introduce and publicly release a Mandarin emotion speech dataset including 9,724 samples with audio fi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
3
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
2
1

Relationship

2
1

Authors

Journals

citations
Cited by 3 publications
(3 citation statements)
references
References 44 publications
0
3
0
Order By: Relevance
“…Environmental sound categorization research is scarce compared to other machine learning sound-and image-processing challenges. Recent works have demonstrated this concept's utility in a variety of contexts, such as virtual assistants [22], automatic voice recognition [23], and text-to-speech applications [24]. The classifier employed in these investigations divides these recent works into two kinds.…”
Section: Related Workmentioning
confidence: 99%
“…Environmental sound categorization research is scarce compared to other machine learning sound-and image-processing challenges. Recent works have demonstrated this concept's utility in a variety of contexts, such as virtual assistants [22], automatic voice recognition [23], and text-to-speech applications [24]. The classifier employed in these investigations divides these recent works into two kinds.…”
Section: Related Workmentioning
confidence: 99%
“…Training TTS and SVS systems both require a significant amount of annotated data [9,10,15]. The rapid increase in the amount of multimedia content on the Internet in recent years makes data much more important.…”
Section: Datasetmentioning
confidence: 99%
“…4) FG-TransformerTTS(Chen & Rudnicky, 2021): The finegrained style control on auto-regressive model Transformer-TTS. 5) Expressive FastSpeech 2(Ren et al, 2020): The combination of both multi-speaker(Chen et al, 2020b) and muli-emotion(Cui et al, 2021) FastSpeech 2, which adds the speaker and emotion d-vectors extracted by the pretrained discriminative models to the backbone. 6) Meta-StyleSpeech(Min et al, 2021): The finetuned multi-speaker text-to-speech model with meta-learning.…”
mentioning
confidence: 99%