Speech emotion recognition (SER) plays a significant role in human–machine interaction. Emotion recognition from speech and its precise classification is a challenging task because a machine is unable to understand its context. For an accurate emotion classification, emotionally relevant features must be extracted from the speech data. Traditionally, handcrafted features were used for emotional classification from speech signals; however, they are not efficient enough to accurately depict the emotional states of the speaker. In this study, the benefits of a deep convolutional neural network (DCNN) for SER are explored. For this purpose, a pretrained network is used to extract features from state-of-the-art speech emotional datasets. Subsequently, a correlation-based feature selection technique is applied to the extracted features to select the most appropriate and discriminative features for SER. For the classification of emotions, we utilize support vector machines, random forests, the k-nearest neighbors algorithm, and neural network classifiers. Experiments are performed for speaker-dependent and speaker-independent SER using four publicly available datasets: the Berlin Dataset of Emotional Speech (Emo-DB), Surrey Audio Visual Expressed Emotion (SAVEE), Interactive Emotional Dyadic Motion Capture (IEMOCAP), and the Ryerson Audio Visual Dataset of Emotional Speech and Song (RAVDESS). Our proposed method achieves an accuracy of 95.10% for Emo-DB, 82.10% for SAVEE, 83.80% for IEMOCAP, and 81.30% for RAVDESS, for speaker-dependent SER experiments. Moreover, our method yields the best results for speaker-independent SER with existing handcrafted features-based SER approaches.
Emotion recognition in conversation is a challenging task because it requires an understanding of the contextual and linguistic aspects of a conversation. Emotion recognition in speech has been well studied, but in bi-directional or multi-directional conversations, emotions can be very complex, mixed, and embedded in context. To tackle this challenge, we propose a method that combines state-of-the-art RoBERTa (robustly optimized BERT pretraining approach) with a Bidirectional long short-term memory (BiLSTM) network for contextualized emotion recognition. RoBERTa is a transformer-based language model, which is an advanced version of the well-known BERT. We use RoBERTa features as input to a BiLSTM model that learns to capture contextual dependencies and sequential patterns in the input text. The proposed model is trained and evaluated on a Multimodal EmotionLines Dataset (MELD) to recognize emotions in conversation. The textual modality of the dataset is utilized for the experimental evaluation, with the weighted average F1 score and accuracy used as performance metrics. The experimental results indicate that the incorporation of a pre-trained transformer-based language model with a BiLSTM network significantly enhances the recognition of emotions in contextualized conversational settings.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.