Speech Emotion Recognition Using Self-Supervised Features

Morais, Edmilson; Hoory, Ron; Zhu, Weizhong; Gat, Itai; Damasceno, Matheus; Aronowitz, Hagai

doi:10.1109/icassp43922.2022.9747870

Cited by 76 publications

(12 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In order to assess the robustness and efficiency of the proposed architecture from a variety of angles, we tested the model on three datasets and conducted a large number of ablation analyses, by this verifying the influence of parameter variables on the predictions. In comparison with earlier studies, we provided a more potent state-of-the-art endto-end model for SER, whose adaptability will encourage the future development of multi-model speech emotion recognition, i. e., by taking advantage of other modalities, such as video and text [68,69,70], Besides, we will also considered how to use chunk-level segments features to create a self-supervised learning framework [71], such as masking some chunk segments during the feature input process and performing contrastive loss on the model output as shown for wav2vec 2.0 [72,73].…”

Section: Discussionmentioning

confidence: 99%

A Residual Multi-Scale Convolutional Transformer Network with Chunk-level Log-Mel Spectrograms for Speech Emotion Recognition

Yan¹,

Wang²,

Parada-Cabaleiro³

et al. 2022

Preprint

View full text Add to dashboard Cite

<p>The great variety of human emotional expression as well as the differences in the ways they perceive and annotate them make Speech Emotion Recognition (SER) an ambiguous and challenging task. With the development of deep learning, long-term progress has been made in supervised SER systems. However, the existing convolutional neural networks present certain limitations, such as their inability to well capture global features, which contain important emotional information. In addition, due to the subjective nature and continuity of emotion, the instance segments in which emotional speech is typically segmented do not fully reflect the true labels and cannot describe dynamic temporal changes. Thus, accurate emotional representation cannot be learnt in the process of feature extraction. In order to overtake these limitations, we propose an end-to-end network only for speech that maps sequences of different lengths to a fixed number of chunks and strictly preserves the order of chunks by adaptively adjusting their overlap. Subsequently, it extracts log-mel spectrogram features from chunk-level segments and feeds them into the Residual Multi-Scale Convolutional Neutral Networks with Transformer(RMSCTx) model framework. Finally, by keeping the order of the chunk-level segments, a temporal domain mean layer is used to further extract utterance-level feature representations. With this method, we perform multidimensional SER, i. e., the prediction of arousal, valence, and dominance. The experimental results on three popular corpora demonstrate not only the superiority of our approach, but also the robustness of the model for SER, showing an improvement of the recognition accuracy in the newest version of the public dataset MSP-Podcast (1.9).</p>

show abstract

Section: Discussionmentioning

confidence: 99%

A Residual Multi-Scale Convolutional Transformer Network with Chunk-level Log-Mel Spectrograms for Speech Emotion Recognition

Yan¹,

Wang²,

Parada-Cabaleiro³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Equation ( 4) shows how the CCC is computed given prediction ĉ, and ground truth c, where s cĉ , s 2 c , s 2 ĉ , c, c represent the covariance between the ground truth and prediction, variance of the ground truth, variance of the prediction, mean of the groundtruth and mean of the prediction respectively. 1 shows our experimental results on the IEMOCAP dataset, where all results are computed using standard 5-fold cross validation [17]. We observe that multi-task training improves CCC by 0.02 over the continuous baseline, however, it does not outperform the discrete baseline.…”

Section: Evaluation Metricsmentioning

confidence: 94%

Unifying the Discrete and Continuous Emotion labels for Speech Emotion Recognition

Sharma¹,

Dhamyal²,

Raj³

et al. 2022

Preprint

View full text Add to dashboard Cite

Traditionally, in paralinguistic analysis for emotion detection from speech, emotions have been identified with discrete or dimensional (continuous-valued) labels. Accordingly, models that have been proposed for emotion detection use one or the other of these label types. However, psychologists like Russell and Plutchik have proposed theories and models that unite these views, maintaining that these representations have shared and complementary information. This paper is an attempt to validate these viewpoints computationally. To this end, we propose a model to jointly predict continuous and discrete emotional attributes and show how the relationship between these can be utilized to improve the robustness and performance of emotion recognition tasks. Our approach comprises multi-task and hierarchical multi-task learning frameworks that jointly model the relationships between continuous-valued and discrete emotion labels. Experimental results on two widely used datasets (IEMOCAP and MSPPodcast) for speech-based emotion recognition show that our model results in statistically significant improvements in performance over strong baselines with non-unified approaches. We also demonstrate that using one type of label (discrete or continuousvalued) for training improves recognition performance in tasks that use the other type of label. Experimental results and reasoning for this approach (called mis-matched training approach) are also presented.

show abstract

“…For more intricate scenarios, the latter approach appears more effective. [10] Addressing complex contexts necessitates comprehensive datasets with rich contextual information and substantial samples. Combining these with gesture and text signals could further enhance SER's versatility.…”

Section: Dataset Variabilitymentioning

confidence: 99%

Advancements and challenges in speech emotion recognition: a comprehensive review

Wang,

Yin,

Zhou

et al. 2024

Fourth International Conference on Signal Processing and Machine Learning (CONF-SPML 2024)

View full text Add to dashboard Cite

As the importance of human-computer interaction (HCI) continues to strengthen and the field of deep learning evolves, numerous models have found their application in the realm of Speech Emotion Recognition (SER), leading to significant advancements in recent years. However, effectively recognizing and processing human emotions through computational systems remains a complex and formidable challenge. This review aims to provide a comprehensive summary of the latest accomplishments in SER, encompassing a diverse range of application scenarios, from education and healthcare to criminal investigation. Additionally, it delves into various models and preprocessing techniques such as Convolutional Neural Networks (CNN), Convolutional Recurrent Neural Networks (CRNN), Long Short-Term Memory (LSTM), and datasets like RAVDESS and RECOLA, which encompass a wide array of scenes and languages. While the recent strides in SER have undeniably achieved impressive accuracy rates, a notable gap exists in research that addresses more intricate emotional contexts, including situations involving irony or sarcasm. Consequently, this review focuses on a comprehensive analysis of the limitations inherent in different feature engineering strategies. Moreover, it investigates the challenge of interpretability posed by complex models, the constraint posed by singular and hard-to-gather datasets, and the expansive scope of potential applications SER could serve. Considering these complexities, a potential pathway to further enhance SER's effectiveness and applicability is proposed. This involves exploring the concept of non-binary emotion classification, harnessing rich contextual information, and integrating datasets that incorporate gesture and textual data. By adapting feature extraction techniques to align with the unique demands of specific scenarios, the performance of SER models could be markedly improved.

show abstract

Speech Emotion Recognition Using Self-Supervised Features

Cited by 76 publications

References 22 publications

A Residual Multi-Scale Convolutional Transformer Network with Chunk-level Log-Mel Spectrograms for Speech Emotion Recognition

A Residual Multi-Scale Convolutional Transformer Network with Chunk-level Log-Mel Spectrograms for Speech Emotion Recognition

Unifying the Discrete and Continuous Emotion labels for Speech Emotion Recognition

Advancements and challenges in speech emotion recognition: a comprehensive review

Contact Info

Product

Resources

About