Interspeech 2020 2020
DOI: 10.21437/interspeech.2020-1190
|View full text |Cite
|
Sign up to set email alerts
|

Multimodal Emotion Recognition Using Cross-Modal Attention and 1D Convolutional Neural Networks

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
8
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 45 publications
(13 citation statements)
references
References 13 publications
0
8
0
Order By: Relevance
“…Validation Accuracy Speech-only LSTM + Attention [24] GRU [25] Bi-LSTM [26] GRU + Attention [13] Bi-LSTM + Attention […”
Section: Modelmentioning
confidence: 99%
“…Validation Accuracy Speech-only LSTM + Attention [24] GRU [25] Bi-LSTM [26] GRU + Attention [13] Bi-LSTM + Attention […”
Section: Modelmentioning
confidence: 99%
“…Following prior works [13], we utilized a residual connection on audio and textual representation to keep the original structure of the data (H r a , H r t ). Then it is passed through a linear and a normalization layer.…”
Section: Cross-modal Robertamentioning
confidence: 99%
“…Previous methods have achieved good performance [8,9], there still exist key challenges in multimodal emotion recognition: different modalities depend on independent preprocessing and feature excavation designs due to heterogeneous space; and to make an applicable and generalizable model for both the individual modalities and fusion model, it is necessary to learn intra-and cross-modal interactions to reveal discriminative emotion content; and emotion is a subjective concept [10∼12]. Moreover, textual information and the associated context conveys more influential information in inferring the speaker's emotions [13], and plays a crucial role in inferring emotions in conversations. Unfortunately, the existing approaches cannot efficiently capture the textual emotional-relevant content in the fusion processing, and often the learning of intra-and crossmodal information will lose some semantic content [12,14].…”
Section: Introductionmentioning
confidence: 99%
“…Most previous studies focus on emotion recognition using only one modality, such as text, video, or voice. However, it has already been shown that algorithms for emotion recognition that are based on multiple modalities, perform better than the ones that use only one modality (Povolny et al, 2016 ; Krishna and Patil, 2020 ). In our study, we go beyond this comparison as we measure the contribution of each encoded modality by intentionally perturbing the input at inference time to either exclude or corrupt some intelligible information, which has already been proven effective for emotion-recognition classification, and monitoring changes in the performance of the model.…”
Section: Introductionmentioning
confidence: 99%