Multi-modal speech emotion recognition using self-attention mechanism and multi-scale fusion framework

Liu, Yang; Sun, Haoqin; Guan, Wenbo; Xia, Yuqi; Zhen, Zhao

doi:10.1016/j.specom.2022.02.006

Cited by 32 publications

(4 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We use the GitHub implementation 3 of the state-of-the-art architecture with our experimental framework and CQT-MSF feature for performance comparison. • We also select the state-of-the-art study performed by Liu et al (2022) for multimodal emotion recognition in our work. As our primary focus is emotion recognition from speech, we use only the bidirectional-contextualised LSTM (bc-LSTM) with multi-head attention block used for speech modality in [90] with our experimental framework and databases.…”

Section: Comparison With Related Workmentioning

confidence: 99%

“…• We also select the state-of-the-art study performed by Liu et al (2022) for multimodal emotion recognition in our work. As our primary focus is emotion recognition from speech, we use only the bidirectional-contextualised LSTM (bc-LSTM) with multi-head attention block used for speech modality in [90] with our experimental framework and databases. As emotion information spreads temporally across utterances, temporal pattern extraction architectures like LSTM, attention, are selected to compare with handcrafted temporal modulation feature, i.e., CQT-MSF.…”

Section: Comparison With Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Modulation spectral features for speech emotion recognition using deep neural networks

Singh

Sahidullah

Saha

2023

Speech Communication

View full text Add to dashboard Cite

Section: Comparison With Related Workmentioning

confidence: 99%

Section: Comparison With Related Workmentioning

confidence: 99%

Modulation spectral features for speech emotion recognition using deep neural networks

Singh

Sahidullah

Saha

2023

Speech Communication

View full text Add to dashboard Cite

“…In addition, though all modalities consist of emotional cues pertinent to emotion recognition, the uttered sentences in the text modality consist of grammatical and semantic features [8] which can provide supplementary knowledge to the model for effective and robust SER. In [37], a model was suggested that uses multi-channel convolutional neural networks (MCNNs) to learn emotional and grammatical features from the text. However, as asserted in [8], CNNs and RNNs can weakly extract grammatical and semantic information since they are good at learning spatial and temporal features but not the context of the sequences.…”

Section: Related Workmentioning

confidence: 99%

Deep Learning-Based Speech Emotion Recognition Using Multi-Level Fusion of Concurrent Features

2022

View full text Add to dashboard Cite

The detection and classification of emotional states in speech involves the analysis of audio signals and text transcriptions. There are complex relationships between the extracted features at different time intervals which ought to be analyzed to infer the emotions in speech. These relationships can be represented as spatial, temporal and semantic tendency features. In addition to emotional features that exist in each modality, the text modality consists of semantic and grammatical tendencies in the uttered sentences. Spatial and temporal features have been extracted sequentially in deep learning-based models using convolutional neural networks (CNN) followed by recurrent neural networks (RNN) which may not only be weak at the detection of the separate spatial-temporal feature representations but also the semantic tendencies in speech. In this paper, we propose a deep learning-based model named concurrent spatial-temporal and grammatical (CoSTGA) model that concurrently learns spatial, temporal and semantic representations in the local feature learning block (LFLB) which are fused as a latent vector to form an input to the global feature learning block (GFLB). We also investigate the performance of multi-level feature fusion compared to single-level fusion using the multi-level transformer encoder model (MLTED) that we also propose in this paper. The proposed CoSTGA model uses multi-level fusion first at the LFLB level where similar features (spatial or temporal) are separately extracted from a modality and secondly at the GFLB level where the spatial-temporal features are fused with the semantic tendency features. The proposed CoSTGA model uses a combination of dilated causal convolutions (DCC), bidirectional long short-term memory (BiLSTM), transformer encoders (TE), multi-head and self-attention mechanisms. Acoustic and lexical features were extracted from the interactive emotional dyadic motion capture (IEMOCAP) dataset. The proposed model achieves 75.50% and 75.82% of weighted and unweighted accuracy, 75.32% and 75.57% of recall and F1 score respectively. These results imply that concurrently learned spatial-temporal features with semantic tendencies learned in a multi-level approach improve the model's effectiveness and robustness.

show abstract

“…Considering the aforementioned problem, the attention mechanism (AM) has been introduced into deep learning models. AM can assist the network in focusing more on the informative features in the inputs by ignoring the features that contribute less to the final classification, and it has been widely used in various applications, such as image segmentation [25,26], document classification [27,28], natural language processing [29,30], and intelligent fault diagnosis [31,32]. When using AM in rolling bearing fault diagnosis models, AM can be beneficial for locating informative fault features with respect to different conditions of bearing health, improving the ability of feature capturing and avoiding negative effects caused by equal treatment of all inputs.…”

Section: Introductionmentioning

confidence: 99%

A dual attention mechanism network with self-attention and frequency channel attention for intelligent diagnosis of multiple rolling bearing fault types

Zhang,

Yang,

et al. 2023

Meas. Sci. Technol.

View full text Add to dashboard Cite

Different fault types of rolling bearings correspond to different features, and classical deep learning models using a single attention mechanism (AM) have limitations in capturing feature diversity. Therefore, a novel dual attention mechanism network (DAMN) with self-attention (SA) and frequency channel attention (FCA) is proposed for rolling bearing fault diagnosis. The SA mechanism is used to capture global relationships between the input features and fault types, and the FCA mechanism applies multi-spectral attention to learn the local useful information among different input channels. The results of the ablation study on the effects of FCA blocks showed that including a proper combination of multiple frequency components is helpful in achieving higher accuracy. Experiments were conducted to diagnose rolling bearings with multiple types of faults. The results show that, compared with current fault diagnosis models, the proposed DAMN has better comprehensive performance in terms of diagnosis accuracy and model convergence speed. It was also demonstrated that the backbone of DAMN based on a dual AM could achieve better performance than the backbone based on a single AM.

show abstract

Multi-modal speech emotion recognition using self-attention mechanism and multi-scale fusion framework

Cited by 32 publications

References 16 publications

Modulation spectral features for speech emotion recognition using deep neural networks

Modulation spectral features for speech emotion recognition using deep neural networks

Deep Learning-Based Speech Emotion Recognition Using Multi-Level Fusion of Concurrent Features

A dual attention mechanism network with self-attention and frequency channel attention for intelligent diagnosis of multiple rolling bearing fault types

Contact Info

Product

Resources

About