2021
DOI: 10.48550/arxiv.2102.01754
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

LSSED: a large-scale dataset and benchmark for speech emotion recognition

Abstract: Speech emotion recognition is a vital contributor to the next generation of human-computer interaction (HCI). However, current existing small-scale databases have limited the development of related research. In this paper, we present LSSED, a challenging large-scale english speech emotion dataset, which has data collected from 820 subjects to simulate realworld distribution. In addition, we release some pre-trained models based on LSSED, which can not only promote the development of speech emotion recognition,… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2021
2021
2021
2021

Publication Types

Select...
3

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(3 citation statements)
references
References 25 publications
0
3
0
Order By: Relevance
“…For speech synthesis, various datasets are available for different tasks. LSSED [11] is a challenging large-scale English speech emotion dataset, which has data collected from 820 subjects to simulate real-world distribution. AISHELL-3 [36] contains roughly 85 hours of emotion-neutral recordings spoken by 218 native Chinese speakers, which could be applied for multi-speaker speech synthesis.…”
Section: Speechmentioning
confidence: 99%
“…For speech synthesis, various datasets are available for different tasks. LSSED [11] is a challenging large-scale English speech emotion dataset, which has data collected from 820 subjects to simulate real-world distribution. AISHELL-3 [36] contains roughly 85 hours of emotion-neutral recordings spoken by 218 native Chinese speakers, which could be applied for multi-speaker speech synthesis.…”
Section: Speechmentioning
confidence: 99%
“…ESD [21] is the first parallel multi-lingual and multi-speaker emotional speech dataset designed for voice conversion tasks and contains five emotional classes in each language. LSSED [22] is a large English dataset designed for SER tasks, which have a nine-classes emotion annotation. EmoV-DB [23] is the first dataset designed for emotional TTS tasks.…”
Section: Introductionmentioning
confidence: 99%
“…They pretrained their novel DNN-based models with 90,000 unlabeled utterances, and fine-tuned and evaluated them on 3,000 randomly selected manually annotated utterances from the same dataset. Fan et al [14] presented a SER dataset with a total duration of over 200 hours. They proposed a novel SER model containing pyramid convolutions which outperformed other models that were tested on the dataset.…”
Section: Introductionmentioning
confidence: 99%