Multi-View Speech Emotion Recognition Via Collective Relation Construction

Hou, Mi-Xiao; Zhang, Zheng; Cao, Qi Zhi; Zhang, David; Lu, Yao

doi:10.1109/taslp.2021.3133196

Cited by 29 publications

(8 citation statements)

References 48 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Initially, EMD algorithms were widely used in signal processing fields such as mechanical fault diagnosis. In recent years, EMD has been applied to the analysis and enhancement of acoustic features [18].…”

Section: A the Acoustic Feature Mapping Module (Afmm)mentioning

confidence: 99%

“…Similarly, Pan et al [17] proposed a strategy for SER by combining the Evolutional Algorithm (EA) with the Empirical Mode Decomposition (EMD) to improve the emotion recognition rate. Hou et al [18] investigated the function of multi-view speech spectrograms, which includes extracting multi-view features by the attention network and the collective relation network. The results of the above methods show that it is feasible to exploit the multi-view speech representations.…”

Section: A Acoustic Features Extractionmentioning

confidence: 99%

See 1 more Smart Citation

Video-Based Cross-Modal Auxiliary Network for Multimodal Sentiment Analysis

Chen

Zhou

et al. 2022

IEEE Trans. Circuits Syst. Video Technol.

View full text Add to dashboard Cite

Multimodal sentiment analysis has a wide range of applications due to its information complementarity in multimodal interactions. Previous works focus more on investigating efficient joint representations, but they rarely consider the insufficient unimodal features extraction and data redundancy of multimodal fusion. In this paper, a Video-based Cross-modal Auxiliary Network (VCAN) is proposed, which is comprised of an audio features map module and a cross-modal selection module. The first module is designed to substantially increase feature diversity in audio feature extraction, aiming to improve classification accuracy by providing more comprehensive acoustic representations. To empower the model to handle redundant visual features, the second module is addressed to efficiently filter the redundant visual frames during integrating audiovisual data. Moreover, a classifier group consisting of several image classification networks is introduced to predict sentiment polarities and emotion categories. Extensive experimental results on RAVDESS, CMU-MOSI, and CMU-MOSEI benchmarks indicate that VCAN is significantly superior to the state-of-the-art methods for improving the classification accuracy of multimodal sentiment analysis.

show abstract

Section: A the Acoustic Feature Mapping Module (Afmm)mentioning

confidence: 99%

Section: A Acoustic Features Extractionmentioning

confidence: 99%

Video-Based Cross-Modal Auxiliary Network for Multimodal Sentiment Analysis

Chen

Zhou

et al. 2022

IEEE Trans. Circuits Syst. Video Technol.

View full text Add to dashboard Cite

show abstract

“…The dimension of features is 43. We use leave-one-speaker-out (LOSO) 10-fold cross-validation to provide an accurate assessment of the proposed IMEMD-CRNN model ( Hou et al, 2022 ). In the LOSO 10-fold cross-validation method, utterances of 8 speakers are used as training set, one speaker is selected as the validation data, and utterances of the left-out speaker are used as the testing set.…”

Section: Resultsmentioning

confidence: 99%

“…The unweighted accuracy of our method reaches 93.54%, greater than the SOTA method by 1.03%. To verify that the improvement in accuracy of the proposed method is statistically significant compared to the SOTA method (the method proposed by Hou et al (2022) ), a paired-sample t -test is used. The null hypothesis is that the pairwise difference between the UA of the two methods has a mean equal to zero.…”

Section: Resultsmentioning

confidence: 99%

“…The speech emotion recognition accuracy is improved when the dataset is small. Hou et al (2022) proposed a collective multi-view relation network (CMRN) based on bidirectional gate recurrent units (Bi-GRU) and the attention mechanism. In the CMRN, Mel-frequency cepstral coefficients (MFCCs), log Mel-frequency spectral coefficients (MFSCs), and prosody features are collected as multi-view representations.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Speech emotion recognition based on improved masking EMD and convolutional recurrent neural network

Sun

2023

Front. Psychol.

View full text Add to dashboard Cite

Speech emotion recognition (SER) is the key to human-computer emotion interaction. However, the nonlinear characteristics of speech emotion are variable, complex, and subtly changing. Therefore, accurate recognition of emotions from speech remains a challenge. Empirical mode decomposition (EMD), as an effective decomposition method for nonlinear non-stationary signals, has been successfully used to analyze emotional speech signals. However, the mode mixing problem of EMD affects the performance of EMD-based methods for SER. Various improved methods for EMD have been proposed to alleviate the mode mixing problem. These improved methods still suffer from the problems of mode mixing, residual noise, and long computation time, and their main parameters cannot be set adaptively. To overcome these problems, we propose a novel SER framework, named IMEMD-CRNN, based on the combination of an improved version of the masking signal-based EMD (IMEMD) and convolutional recurrent neural network (CRNN). First, IMEMD is proposed to decompose speech. IMEMD is a novel disturbance-assisted EMD method and can determine the parameters of masking signals to the nature of signals. Second, we extract the 43-dimensional time-frequency features that can characterize the emotion from the intrinsic mode functions (IMFs) obtained by IMEMD. Finally, we input these features into a CRNN network to recognize emotions. In the CRNN, 2D convolutional neural networks (CNN) layers are used to capture nonlinear local temporal and frequency information of the emotional speech. Bidirectional gated recurrent units (BiGRU) layers are used to learn the temporal context information further. Experiments on the publicly available TESS dataset and Emo-DB dataset demonstrate the effectiveness of our proposed IMEMD-CRNN framework. The TESS dataset consists of 2,800 utterances containing seven emotions recorded by two native English speakers. The Emo-DB dataset consists of 535 utterances containing seven emotions recorded by ten native German speakers. The proposed IMEMD-CRNN framework achieves a state-of-the-art overall accuracy of 100% for the TESS dataset over seven emotions and 93.54% for the Emo-DB dataset over seven emotions. The IMEMD alleviates the mode mixing and obtains IMFs with less noise and more physical meaning with significantly improved efficiency. Our IMEMD-CRNN framework significantly improves the performance of emotion recognition.

show abstract