ShEMO: a large-scale validated database for Persian speech emotion detection

Nezami, Omid Mohamad; Lou, Paria Jamshid; Karami, Mansoureh

doi:10.1007/s10579-018-9427-x

Cited by 53 publications

(9 citation statements)

References 43 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Samples of aggressive English speech ( n = 39) were taken from a corpus of recordings of British drama students, who were instructed to imagine that they were about to attack someone in a fight and to yell ‘That's enough, I'm coming for you!’ [25]. Recordings of aggressive Persian speech ( n = 43) were obtained from ShEMO—an open corpus of emotional speech compiled from radio plays [28]. In contrast with the lexically identical English recordings, the Persian utterances were taken from different contexts and were not repetitions of the same phrase.…”

Section: Methodsmentioning

confidence: 99%

Static and dynamic formant scaling conveys body size and aggression

2022

View full text Add to dashboard Cite

When producing intimidating aggressive vocalizations, humans and other animals often extend their vocal tracts to lower their voice resonance frequencies (formants) and thus sound big. Is acoustic size exaggeration more effective when the vocal tract is extended before, or during, the vocalization, and how do listeners interpret within-call changes in apparent vocal tract length? We compared perceptual effects of static and dynamic formant scaling in aggressive human speech and nonverbal vocalizations. Acoustic manipulations corresponded to elongating or shortening the vocal tract either around (Experiment 1) or from (Experiment 2) its resting position. Gradual formant scaling that preserved average frequencies conveyed the impression of smaller size and greater aggression, regardless of the direction of change. Vocal tract shortening from the original length conveyed smaller size and less aggression, whereas vocal tract elongation conveyed larger size and more aggression, and these effects were stronger for static than for dynamic scaling. Listeners familiarized with the speaker's natural voice were less often ‘fooled’ by formant manipulations when judging speaker size, but paid more attention to formants when judging aggressive intent. Thus, within-call vocal tract scaling conveys emotion, but a better way to sound large and intimidating is to keep the vocal tract consistently extended.

show abstract

Section: Methodsmentioning

confidence: 99%

Static and dynamic formant scaling conveys body size and aggression

2022

View full text Add to dashboard Cite

show abstract

“…In this work, we used four popular English datasets (TESS [13], RAVEDESS [14], SAVEE [15], IEMOCAP [16]) and one German dataset (EMODB [17]) as source for pretraining. We selected three low-resource language datasets for adaption -Italian (EMOVO [18]), Persian (SHEMO [19]), and Urdu (URDU [20]). Table 1 lists down the corpus statistics for source and target datasets.…”

Section: Studied Languages and Datamentioning

confidence: 99%

Meta-Learning for Low-Resource Speech Emotion Recognition

Chopra

Mathur

Sawhney

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

While emotion recognition is a well-studied task, it remains unexplored to a large extent in cross-lingual settings. Speech Emotion Recognition (SER) in low-resource languages poses difficulties as existing approaches for knowledge transfer do not generalize seamlessly. Probing the learning process of generalized representations across languages, we propose a meta-learning approach for low-resource speech emotion recognition. The proposed approach achieves fast adaptation on a number of unseen target languages simultaneously. We evaluate the Model Agnostic Meta-Learning (MAML) algorithm on three low-resource target languages -Persian, Italian, and Urdu. We empirically demonstrate that our proposed method -MetaSER 1 , considerably outperforms multitask and transfer learning-based methods for speech emotion recognition task, and discuss the benefits, efficiency, and challenges of MetaSER on limited data settings.

show abstract

“…6) ShEMO: Sharif Emotional Speech Database [27] is a Persian emotional speech dataset that contains 3000 seminatural utterances extracted from online radio plays and labeled considering the emotions anger, fear, happiness, sadness, surprise, and neutral state, by a group of 12 annotators of both sexes.…”

Section: A Speech Databasesmentioning

confidence: 99%

Language-agnostic speech anger identification

Saitta

Ntalampiras

2021

2021 44th International Conference on Telecommunications and Signal Processing (TSP)

View full text Add to dashboard Cite

Following the constantly increasing adoption of affective computing based solutions, this paper investigates the feasibility of multilingual anger identification. To this end, we formed such a corpus by suitably combining seven different datasets representing five different languages, i.e. English, German, Italian, Urdu, and Persian. After analyzing the diverse characteristics of the datasets, we designed four classification algorithms, namely Support Vector Machine, Decision Treebased Bagging scheme, Convolutional Neural Network, and Convolutional Recurrent Neural Network. Such classification mechanisms are trained on appropriate features extracted from time and/or frequency domains, while speech data have been balanced considering every diverse characteristic incorporated in the datasets (language, sex, acted, etc.). Our findings render multilingual anger identification feasible since the proposed audio pattern recognition methodology based on Mel-spectrograms and CRNN achieved quite satisfactory identification rates.

show abstract

ShEMO: a large-scale validated database for Persian speech emotion detection

Cited by 53 publications

References 43 publications

Static and dynamic formant scaling conveys body size and aggression

Static and dynamic formant scaling conveys body size and aggression

Meta-Learning for Low-Resource Speech Emotion Recognition

Language-agnostic speech anger identification

Contact Info

Product

Resources

About