Assessing the effectiveness of ensembles in Speech Emotion Recognition: Performance analysis under challenging scenarios

López-Gil, Juan-Miguel; Garay-Vitoria, Nestor

doi:10.1016/j.eswa.2023.122905

Cited by 3 publications

(1 citation statement)

References 71 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In addition, different data sample augmentation methods, such as random mixing, adversarial training, transfer learning, and curriculum learning, could be applied to improve the training sample number and to enrich the data diversity [41]. Last but not least, ensembles of existing SER models would be beneficial to further enhance performance, although how to properly combine these models remains challenging [74].…”

Section: Discussionmentioning

confidence: 99%

Speech Emotion Recognition Using Dual-Stream Representation and Cross-Attention Fusion

Yu,

Meng,

Fan

et al. 2024

Electronics

View full text Add to dashboard Cite

Speech emotion recognition (SER) aims to recognize human emotions through in-depth analysis of audio signals. However, it remains challenging to encode emotional cues and to fuse the encoded cues effectively. In this study, dual-stream representation is developed, and both full training and fine-tuning of different deep networks are employed for encoding emotion patterns. Specifically, a cross-attention fusion (CAF) module is designed to integrate the dual-stream output for emotion recognition. Using different dual-stream encoders (fully training a text processing network and fine-tuning a pre-trained large language network), the CAF module is compared to other three fusion modules on three databases. The SER performance is quantified with weighted accuracy (WA), unweighted accuracy (UA), and F1-score (F1S). The experimental results suggest that the CAF outperforms the other three modules and leads to promising performance on the databases (EmoDB: WA, 97.20%; UA, 97.21%; F1S, 0.8804; IEMOCAP: WA, 69.65%; UA, 70.88%; F1S, 0.7084; RAVDESS: WA, 81.86%; UA, 82.75.21%; F1S, 0.8284). It is also found that fine-tuning a pre-trained large language network achieves superior representation than fully training a text processing network. In a future study, improved SER performance could be achieved through the development of a multi-stream representation of emotional cues and the incorporation of a multi-branch fusion mechanism for emotion recognition.

show abstract

Section: Discussionmentioning

confidence: 99%

Speech Emotion Recognition Using Dual-Stream Representation and Cross-Attention Fusion

Yu,

Meng,

Fan

et al. 2024

Electronics

View full text Add to dashboard Cite

show abstract

Data augmentation using a 1D-CNN model with MFCC/MFMC features for speech emotion recognition

Mary Little Flower,

Jaya,

Christopher Ezhil Singh

2024

Automatika

View full text Add to dashboard Cite

E‐Speech: Development of a Dataset for Speech Emotion Recognition and Analysis

Liu,

Shi,

Zhang

et al. 2024

International Journal of Intelligent Systems

View full text Add to dashboard Cite

Speech emotion recognition plays a crucial role in analyzing psychological disorders, behavioral decision‐making, and human‐machine interaction applications. However, the majority of current methods for speech emotion recognition heavily rely on data‐driven approaches, and the scarcity of emotion speech datasets limits the progress in research and development of emotion analysis and recognition. To address this issue, this study introduces a new English speech dataset specifically designed for emotion analysis and recognition. This dataset consists of 5503 voices from over 60 English speakers in different emotional states. Furthermore, to enhance emotion analysis and recognition, fast Fourier transform (FFT), short‐time Fourier transform (STFT), mel‐frequency cepstral coefficients (MFCCs), and continuous wavelet transform (CWT) are employed for feature extraction from the speech data. Utilizing these algorithms, the spectrum images of the speeches are obtained, forming four datasets consisting of different speech feature images. Furthermore, to evaluate the dataset, 16 classification models and 19 detection algorithms are selected. The experimental results demonstrate that the majority of classification and detection models achieve exceptionally high recognition accuracy on this dataset, confirming its effectiveness and utility. The dataset proves to be valuable in advancing research and development in the field of emotion recognition.

show abstract

Assessing the effectiveness of ensembles in Speech Emotion Recognition: Performance analysis under challenging scenarios

Cited by 3 publications

References 71 publications

Speech Emotion Recognition Using Dual-Stream Representation and Cross-Attention Fusion

Speech Emotion Recognition Using Dual-Stream Representation and Cross-Attention Fusion

Data augmentation using a 1D-CNN model with MFCC/MFMC features for speech emotion recognition

E‐Speech: Development of a Dataset for Speech Emotion Recognition and Analysis

Contact Info

Product

Resources

About