Towards Robust Speech Emotion Recognition Using Deep Residual Networks for Speech Enhancement

Triantafyllopoulos, Andreas; Keren, Gil; Wagner, Johannes; Steiner, Ingmar; Schuller, Björn

doi:10.21437/interspeech.2019-1811

Cited by 48 publications

(34 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The authors in [45] used 1D and 2D CNN-LSTM networks to identify speech emotions. The authors in [40] analyzed the effect noise removal techniques have on SER systems. The authors in [11] performed transfer learning and multi-task learning experiments and found that traditional machine learning models may function as well as deep learning models [2,41] for speech emotion recognition given the researchers choose the right input feature.…”

Section: Related Workmentioning

confidence: 99%

Cross corpus multi-lingual speech emotion recognition using ensemble learning

Zehra

Javed

Jalil

et al. 2021

Complex Intell. Syst.

103

View full text Add to dashboard Cite

Receiving an accurate emotional response from robots has been a challenging task for researchers for the past few years. With the advancements in technology, robots like service robots interact with users of different cultural and lingual backgrounds. The traditional approach towards speech emotion recognition cannot be utilized to enable the robot and give an efficient and emotional response. The conventional approach towards speech emotion recognition uses the same corpus for both training and testing of classifiers to detect accurate emotions, but this approach cannot be generalized for multi-lingual environments, which is a requirement for robots used by people all across the globe. In this paper, a series of experiments are conducted to highlight an ensemble learning effect using a majority voting technique for cross-corpus, multi-lingual speech emotion recognition system. A comparison of the performance of an ensemble learning approach against traditional machine learning algorithms is performed. This study tests a classifier’s performance trained on one corpus with data from another corpus to evaluate its efficiency for multi-lingual emotion detection. According to experimental analysis, different classifiers give the highest accuracy for different corpora. Using an ensemble learning approach gives the benefit of combining all classifiers’ effect instead of choosing one classifier and compromising certain language corpus’s accuracy. Experiments show an increased accuracy of 13% for Urdu corpus, 8% for German corpus, 11% for Italian corpus, and 5% for English corpus from with-in corpus testing. For cross-corpus experiments, an improvement of 2% when training on Urdu data and testing on German data and 15% when training on Urdu data and testing on Italian data is achieved. An increase of 7% in accuracy is obtained when testing on Urdu data and training on German data, 3% when testing on Urdu data and training on Italian data, and 5% when testing on Urdu data and training on English data. Experiments prove that the ensemble learning approach gives promising results against other state-of-the-art techniques.

show abstract

Section: Related Workmentioning

confidence: 99%

Cross corpus multi-lingual speech emotion recognition using ensemble learning

Zehra

Javed

Jalil

et al. 2021

Complex Intell. Syst.

103

View full text Add to dashboard Cite

show abstract

“…Speech emotion recognition is considered a challenging task in the HCI domain. A large number of methodologies and corpora have been proposed in previous works [10][11] [12]. The early stage of SER research used handcrafted speech features and low-level descriptors to train classic machine learning models.…”

Section: Related Workmentioning

confidence: 99%

Autoencoder With Emotion Embedding for Speech Emotion Recognition

Zhang

Xue

2021

IEEE Access

View full text Add to dashboard Cite

An important part of the human-computer interaction process is speech emotion recognition (SER), which has been receiving more attention in recent years. However, although a wide diversity of methods has been proposed in SER, these approaches still cannot improve the performance. A key issue in the low performance of the SER system is how to effectively extract emotion-oriented features. In this paper, we propose a novel algorithm, an autoencoder with emotion embedding, to extract deep emotion features. Unlike many previous works, instance normalization, which is a common technique in the style transfer field, is introduced into our model rather than batch normalization. Furthermore, the emotion embedding path in our method can lead the autoencoder to efficiently learn a priori knowledge from the label. It can enable the model to distinguish which features are most related to human emotion. We concatenate the latent representation learned by the autoencoder and acoustic features obtained by the openSMILE toolkit. Finally, the concatenated feature vector is utilized for emotion classification. To improve the generalization of our method, a simple data augmentation approach is applied. Two publicly available and highly popular databases, IEMOCAP and EMODB, are chosen to evaluate our method. Experimental results demonstrate that the proposed model achieves significant performance improvement compared to other speech emotion recognition systems.

show abstract

“…Despite the significant progress in Speech Emotion Recognition (SER) through Deep Neural Networks (DNNs), SER systems still perform poorly in noisy environments [1,2], and when the imperceptible adversarial perturbation is added to test examples [3]. The performance of state-of-the-art SER also degrades in the cross-corpus setting when an acoustic mismatch between training and testing exists [4].…”

Section: Introductionmentioning

confidence: 99%

“…This shows that SER systems lack robustness and generalisation which makes them susceptible to unknown test data shifts. Researchers have developed various methods to improve the performance of SER in noisy environment [2,5] and cross-corpus setting [6], however, significant performance improvement is still required.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Deep Architecture Enhancing Robustness to Noise, Adversarial Attacks, and Cross-Corpus Setting for Speech Emotion Recognition

Latif¹,

Rana²,

Khalifa³

et al. 2020

Interspeech 2020

Self Cite

View full text Add to dashboard Cite

Speech emotion recognition systems (SER) can achieve high accuracy when the training and test data are identically distributed, but this assumption is frequently violated in practice and the performance of SER systems plummet against unforeseen data shifts. The design of robust models for accurate SER is challenging, which limits its use in practical applications. In this paper we propose a deeper neural network architecture wherein we fuse Dense Convolutional Network (DenseNet), Long shortterm memory (LSTM) and Highway Network to learn powerful discriminative features which are robust to noise. We also propose data augmentation with our network architecture to further improve the robustness. We comprehensively evaluate the architecture coupled with data augmentation against (1) noise, (2) adversarial attacks and (3) cross-corpus settings. Our evaluations on the widely used IEMOCAP and MSP-IMPROV datasets show promising results when compared with existing studies and state-of-the-art models.

show abstract

Towards Robust Speech Emotion Recognition Using Deep Residual Networks for Speech Enhancement

Cited by 48 publications

References 24 publications

Cross corpus multi-lingual speech emotion recognition using ensemble learning

Cross corpus multi-lingual speech emotion recognition using ensemble learning

Autoencoder With Emotion Embedding for Speech Emotion Recognition

Deep Architecture Enhancing Robustness to Noise, Adversarial Attacks, and Cross-Corpus Setting for Speech Emotion Recognition

Contact Info

Product

Resources

About