Autonomous Emotion Learning in Speech: A View of Zero-Shot Speech Emotion Recognition

Xu, Xinzhou; Deng, Jun; Cummins, Nicholas; Zhao, Li; Schuller, Björn

doi:10.21437/interspeech.2019-2406

Cited by 13 publications

(23 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Zero-Shot Speech Emotion Recognition: A basic framework for zero-shot learning in SER, containing two phases of attribute learning and label learning, is presented in [24]. The attribute-learning phase constructs the relationship between paralinguistic features and emotional descriptors or attributes, through the procedure of regression on seen-emotional samples using Support Vector Regression (SVR) or Deep Neural Networks (DNNs).…”

Section: A Related Workmentioning

confidence: 99%

“…In order to deal with recognising unseen-emotional speech samples, we propose an autonomous learning strategy based on zero-shot emotion recognition [24]. Zero-Shot Learning (ZSL) has demonstrated high utility in image processing [25]- [28] and affective computing [29]- [31].…”

Section: Introductionmentioning

confidence: 99%

“…Zero-Shot Learning (ZSL) has demonstrated high utility in image processing [25]- [28] and affective computing [29]- [31]. We have also demonstrated a zero-shot framework for SER [24]. However, this framework required a high number of fully labelled emotional descriptors or attributes (i.e., emotional dimensions) which construct the label-learning models for seen-emotional samples.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Exploring Zero-Shot Emotion Recognition in Speech Using Semantic-Embedding Prototypes

Deng²,

Cummins

et al. 2022

IEEE Trans. Multimedia

Self Cite

View full text Add to dashboard Cite

Speech Emotion Recognition (SER) makes it possible for machines to perceive affective information. Our previous research differed from conventional SER endeavours in that it focused on recognising unseen emotions in speech autonomously through machine learning. Such a step would enable the automatic leaning of unknown emerging emotional states. This type of learning framework, however, still relied on manual annotations to obtain multiple samples of each emotion. In order to reduce this additional workload, herein, we propose a zero-shot SER framework employing a per-emotion semantic-embedding paradigm to describe emotions in zero-shot SER, instead of using the sample-wise descriptors. Aiming to optimise the relationship between emotions, prototypes, and speech samples, this framework includes two types of learning strategies: Sample-wise learning and emotion-wise learning. These strategies apply a novel learning process to speech samples and emotions, respectively, via specifically designed semantic-embedding prototypes. We verify the utility of these approaches by performing an extensive experimental evaluation on two corpora on three aspects, namely the influence of different types of learning strategies, emotional-pair comparison, and the selections of semantic-embedding prototypes and paralinguistic features. The experimental results indicate that it is applicable to use semantic-embedding prototypes for zero-shot emotion recognition in speech, despite the influence of choosing optimal strategies and prototypes.

show abstract

Section: A Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Exploring Zero-Shot Emotion Recognition in Speech Using Semantic-Embedding Prototypes

Deng²,

Cummins

et al. 2022

IEEE Trans. Multimedia

Self Cite

View full text Add to dashboard Cite

show abstract

“…It has been shown that the CNN module combined with long short-term memory (LSTM) neural networks works better than standalone CNN or LSTM based models for SER tasks in cross-corpus setting [37]. Other deep learning-based models like zero-shot learning, which learns using only a few labels [51] and Generative Adversarial Networks (GANs) to generate synthetic samples for robust learning have also been studied [10].…”

Section: Related Workmentioning

confidence: 99%

SEMOUR: A Scripted Emotional Speech Repository for Urdu

Zaheer,

Ahmad,

Ahmed

et al. 2021

Preprint

View full text Add to dashboard Cite

“…Computational paralinguistics make it possible to extract latent knowledge in audio signals (i.e., spoken signals) from human beings or animals [1][2][3]. Typical paralinguistics-related topics include emotion and personality recognition [4][5][6], autism diagnosis [7], native-speaker identification [8], or eating classification [9]. As an emerging topic in paralinguistics, Mask-Speech Identification (MSI) attempts to automatically distinguish whether a spoken utterance is pronounced by its speaker with or without a surgical mask [4].…”

Section: Introductionmentioning

confidence: 99%

Identifying surgical-mask speech using deep neural networks on low-level aggregation

Deng

et al. 2021

Proceedings of the 36th Annual ACM Symposium on Applied Computing

Self Cite

View full text Add to dashboard Cite

The task of Mask-Speech Identification (MSI) aims at judging whether a chunk of speech is pronounced when the speaker is wearing a facial mask or not. Most of the existing related research focuses on investigating the influence of wearing a mask, which only adapts in some certain cases to speech analysis. Thus in order to generalise the research on MSI, we propose an MSI approach using deep networks on Low-Level Aggregation (LLA) for speech chunks. The proposed approach benefits from data augmentation on Low-Level Descriptors (LLDs), resulting in more adaptation to deep models through inputting much more samples in training without employing pre-trained knowledge. Experiments are performed on the dataset of Mask Augsburg Speech Corpus (MSC) used in the INTERSPEECH 2020 ComParE challenge, considering the influence from employing different strategies. The experimental results show effectiveness of the proposed approach compared with the ComParE challenge baselines. CCS CONCEPTS• Applied computing → Life and medical sciences; Life and medical sciences; • Human-centered computing → Human computer interaction (HCI); • Computing methodologies → Machine learning;

show abstract

Autonomous Emotion Learning in Speech: A View of Zero-Shot Speech Emotion Recognition

Cited by 13 publications

References 28 publications

Exploring Zero-Shot Emotion Recognition in Speech Using Semantic-Embedding Prototypes

Exploring Zero-Shot Emotion Recognition in Speech Using Semantic-Embedding Prototypes

SEMOUR: A Scripted Emotional Speech Repository for Urdu

Identifying surgical-mask speech using deep neural networks on low-level aggregation

Contact Info

Product

Resources

About