Speech Emotion Recognition (SER) has been a challenging task on which researchers have been working for decades. Recently, Deep Learning (DL) based approaches have been shown to perform well in SER tasks; however, it has been noticed that their superior performance is limited to the distribution of the data used to train the model. In this paper, we present an analysis of using autoencoders to improve the generalisability of DL based SER solutions. We train a sparse autoencoder using a large speech corpus extracted from social media. Later, the trained encoder part of the autoencoder is reused as the input to a long short-term memory (LSTM) network, and the encoder-LSTM modal is re-trained on an aggregation of five commonly used speech emotion corpora. Our evaluation uses an unseen corpus in the training & validation stages to simulate 'in the wild' condition and analyse the generalisability of our solution. A performance comparison is carried out between the encoder based model and a model trained without an encoder. Our results show that the autoencoder based model improves the unweighted accuracy of the unseen corpus by 8%, indicating autoencoder based pre-training can improve the generalisability of DL based SER solutions.
Extracting emotions from physiological signals has become popular over the past decade. Recent advancements in wearable smart devices have enabled capturing physiological signals continuously and unobtrusively. However, signal readings from different smart wearables are lossy due to user activities, making it difficult to develop robust models for emotion recognition. Also, the limited availability of data labels is an inherent challenge for developing machine learning techniques for emotion classification. This paper presents a novel self-supervised approach inspired by contrastive learning to address the above challenges. In particular, our proposed approach develops a method to learn representations of individual physiological signals, which can be used for downstream classification tasks. Our evaluation with four publicly available datasets shows that the proposed method surpasses the emotion recognition performance of state-of-the-art techniques for emotion classification. In addition, we show that our method is more robust to losses in the input signal.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.