Spoken emotion recognition via locality-constrained kernel sparse representation

Zhao, Xiaoming; Zhang, Shiqing

doi:10.1007/s00521-014-1755-1

Cited by 11 publications

(3 citation statements)

References 55 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Other classifiers are GMM [9], HMM [10] and SVM [11], which are widely adopted for SER system. Furthermore, some advanced sparse representation-based classifiers [12], [13] have been published. Nevertheless, each classifier has its own advantages and disadvantages.…”

Section: Related Workmentioning

confidence: 99%

Multimodal Approach of Speech Emotion Recognition Using Multi-Level Multi-Head Fusion Attention-Based Recurrent Neural Network

et al. 2020

View full text Add to dashboard Cite

Speech emotion recognition is a challenging but important task in human computer interaction (HCI). As technology and understanding of emotion are progressing, it is necessary to design robust and reliable emotion recognition systems that are suitable for real-world applications both to enhance analytical abilities supporting human decision making and to design human-machine interfaces (HMI) that assist efficient communication. This paper presents a multimodal approach for speech emotion recognition based on Multi-Level Multi-Head Fusion Attention mechanism and recurrent neural network (RNN). The proposed structure has inputs of two modalities: audio and text. For audio features, we determine the mel-frequency cepstrum (MFCC) from raw signals using the OpenSMILE toolbox. Further, we use pre-trained model of bidirectional encoder representations from transformers (BERT) for embedding text information. These features are fed parallelly into the self-attention mechanism base RNNs to exploit the context for each timestamp, then we fuse all representatives using multi-head attention technique to predict emotional states. Our experimental results on the three databases: Interactive Emotional Motion Capture (IEMOCAP), Multimodal EmotionLines Dataset (MELD), and CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI), reveal that the combination of the two modalities achieves better performance than using single models. Quantitative and qualitative evaluations on all introduced datasets demonstrate that the proposed algorithm performs favorably against state-of-the-art methods. INDEX TERMS Speech emotion recognition, multi-level multi-head fusion attention, RNN, audio features, textual features.

show abstract

Section: Related Workmentioning

confidence: 99%

Multimodal Approach of Speech Emotion Recognition Using Multi-Level Multi-Head Fusion Attention-Based Recurrent Neural Network

et al. 2020

View full text Add to dashboard Cite

show abstract

“…As for emotion classifier, various traditional machine learning methods can be utilized for cross-corpus SER. The representative emotion classification methods contain linear discriminant classifier (LDC) (Banse and Scherer, 1996 ; Dellaert et al, 1996 ), K-Nearest Neighbor (Dellaert et al, 1996 ), artificial neural network (ANN) (Nicholson et al, 2000 ), support vector machines (SVM) (Kwon et al, 2003 ), hidden Markov models (HMM) (Nwe et al, 2003 ), Gaussian mixture models (GMM) (Ververidis and Kotropoulos, 2005 ), sparse representation classification (SRC) (Zhao and Zhang, 2015 ) and so on. Nevertheless, each classifier has its own advantages and disadvantages.…”

Section: Introductionmentioning

confidence: 99%

Deep Cross-Corpus Speech Emotion Recognition: Recent Advances and Perspectives

et al. 2021

Self Cite

View full text Add to dashboard Cite

Automatic speech emotion recognition (SER) is a challenging component of human-computer interaction (HCI). Existing literatures mainly focus on evaluating the SER performance by means of training and testing on a single corpus with a single language setting. However, in many practical applications, there are great differences between the training corpus and testing corpus. Due to the diversity of different speech emotional corpus or languages, most previous SER methods do not perform well when applied in real-world cross-corpus or cross-language scenarios. Inspired by the powerful feature learning ability of recently-emerged deep learning techniques, various advanced deep learning models have increasingly been adopted for cross-corpus SER. This paper aims to provide an up-to-date and comprehensive survey of cross-corpus SER, especially for various deep learning techniques associated with supervised, unsupervised and semi-supervised learning in this area. In addition, this paper also highlights different challenges and opportunities on cross-corpus SER tasks, and points out its future trends.

show abstract

“…Generative adversarial network (GAN) is a prevalent generative model [8], [12], [21]. Deep convolutional generative adversarial network based on traditional generative adversarial networks, introduces convolutional neural networks (CNN) into the training for unsupervised learning to improve the effect of generative networks.…”

Section: Introductionmentioning

confidence: 99%

Emotion Interaction Recognition Based on Deep Adversarial Network in Interactive Design for Intelligent Robot

Chen

et al. 2019

IEEE Access

View full text Add to dashboard Cite

Augmented Reality devices (AR), virtual reality devices (VR), are changing our lives and it is critical to provide intelligent interaction and improve the user's intelligent interactive experience. When artificial intelligence is introduced into Intelligent interaction emotion classification or semantic segmentation and other tasks, it requires professional knowledge to manually label images sample. To address the problem of scarcity of labeled data in emotion classification, an improved classification method based on semisupervised generative adversarial networks (GAN) is proposed in this paper. Firstly, the output layer of the traditional unsupervised GAN is replaced with Softmax layer to obtain the semi-supervised GAN. Secondly, additional labels are defined for generated samples to guiding the training process. Finally, we employ a semi-supervised training strategy to optimize the parameters of GAN and use the trained network to process videos. Experiments on existing public datasets show that our method has a certain improvement in compared with the classic methods based on deep learning, and has a higher recognition efficiency, which is more suitable for dimension emotion recognition of large-scale data.

show abstract

Spoken emotion recognition via locality-constrained kernel sparse representation

Cited by 11 publications

References 55 publications

Multimodal Approach of Speech Emotion Recognition Using Multi-Level Multi-Head Fusion Attention-Based Recurrent Neural Network

Multimodal Approach of Speech Emotion Recognition Using Multi-Level Multi-Head Fusion Attention-Based Recurrent Neural Network

Deep Cross-Corpus Speech Emotion Recognition: Recent Advances and Perspectives

Emotion Interaction Recognition Based on Deep Adversarial Network in Interactive Design for Intelligent Robot

Contact Info

Product

Resources

About