2022
DOI: 10.1016/j.specom.2022.02.006
|View full text |Cite
|
Sign up to set email alerts
|

Multi-modal speech emotion recognition using self-attention mechanism and multi-scale fusion framework

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

0
4
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
8

Relationship

0
8

Authors

Journals

citations
Cited by 32 publications
(4 citation statements)
references
References 16 publications
0
4
0
Order By: Relevance
“…We use the GitHub implementation 3 of the state-of-the-art architecture with our experimental framework and CQT-MSF feature for performance comparison. • We also select the state-of-the-art study performed by Liu et al (2022) for multimodal emotion recognition in our work. As our primary focus is emotion recognition from speech, we use only the bidirectional-contextualised LSTM (bc-LSTM) with multi-head attention block used for speech modality in [90] with our experimental framework and databases.…”
Section: Comparison With Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…We use the GitHub implementation 3 of the state-of-the-art architecture with our experimental framework and CQT-MSF feature for performance comparison. • We also select the state-of-the-art study performed by Liu et al (2022) for multimodal emotion recognition in our work. As our primary focus is emotion recognition from speech, we use only the bidirectional-contextualised LSTM (bc-LSTM) with multi-head attention block used for speech modality in [90] with our experimental framework and databases.…”
Section: Comparison With Related Workmentioning
confidence: 99%
“…• We also select the state-of-the-art study performed by Liu et al (2022) for multimodal emotion recognition in our work. As our primary focus is emotion recognition from speech, we use only the bidirectional-contextualised LSTM (bc-LSTM) with multi-head attention block used for speech modality in [90] with our experimental framework and databases. As emotion information spreads temporally across utterances, temporal pattern extraction architectures like LSTM, attention, are selected to compare with handcrafted temporal modulation feature, i.e., CQT-MSF.…”
Section: Comparison With Related Workmentioning
confidence: 99%
“…In addition, though all modalities consist of emotional cues pertinent to emotion recognition, the uttered sentences in the text modality consist of grammatical and semantic features [8] which can provide supplementary knowledge to the model for effective and robust SER. In [37], a model was suggested that uses multi-channel convolutional neural networks (MCNNs) to learn emotional and grammatical features from the text. However, as asserted in [8], CNNs and RNNs can weakly extract grammatical and semantic information since they are good at learning spatial and temporal features but not the context of the sequences.…”
Section: Related Workmentioning
confidence: 99%
“…Considering the aforementioned problem, the attention mechanism (AM) has been introduced into deep learning models. AM can assist the network in focusing more on the informative features in the inputs by ignoring the features that contribute less to the final classification, and it has been widely used in various applications, such as image segmentation [25,26], document classification [27,28], natural language processing [29,30], and intelligent fault diagnosis [31,32]. When using AM in rolling bearing fault diagnosis models, AM can be beneficial for locating informative fault features with respect to different conditions of bearing health, improving the ability of feature capturing and avoiding negative effects caused by equal treatment of all inputs.…”
Section: Introductionmentioning
confidence: 99%