2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021) 2021
DOI: 10.1109/fg52635.2021.9667055
|View full text |Cite
|
Sign up to set email alerts
|

Cross Attentional Audio-Visual Fusion for Dimensional Emotion Recognition

Abstract: Multimodal emotion recognition has recently gained much attention since it can leverage diverse and complementary relationships over multiple modalities, such as audio, visual, and biosignals. Most state-of-the-art methods for audio-visual (A-V) fusion rely on recurrent networks or conventional attention mechanisms that do not effectively leverage the complementary nature of A-V modalities. In this paper, we focus on dimensional emotion recognition based on the fusion of facial and vocal modalities extracted f… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
25
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
3
1

Relationship

1
6

Authors

Journals

citations
Cited by 27 publications
(25 citation statements)
references
References 61 publications
0
25
0
Order By: Relevance
“…Since multimodal analysis can leverage both independent and complementary information to provide comprehensive representations [25], such a technique has drawn much interest in sentiment analysis [26]. The multimodal fusion strategy can be grouped into feature-level fusion [27] and decision-level fusion, and the latter is the current mainstream fusion approach [28].…”
Section: Audiovisual Information Fusionmentioning
confidence: 99%
See 1 more Smart Citation
“…Since multimodal analysis can leverage both independent and complementary information to provide comprehensive representations [25], such a technique has drawn much interest in sentiment analysis [26]. The multimodal fusion strategy can be grouped into feature-level fusion [27] and decision-level fusion, and the latter is the current mainstream fusion approach [28].…”
Section: Audiovisual Information Fusionmentioning
confidence: 99%
“…The video-based audiovisual fusion focuses more on capturing the complex spatiotemporal and semantic relationships among consecutive video frames. Under this condition, the Recurrent network with its variants and the attention mechanism [32] are widely used in multimodal fusion [26]. For example, Ou et al [4] extended the attention mechanism to obtain a global representation of effective video.…”
Section: Audiovisual Information Fusionmentioning
confidence: 99%
“…1. More specifically, we have considered four baseline models: Cross-Attention (CA) [2], Joint Cross-Attention (JCA) [12], Recursive Joint Cross-Attention (RJCA) [13], and Transformerbased Cross-Attention (TCA) [9]. First, we have implemented a simple baseline of the CA model to capture the complementary relationship between audio and visual modalities.…”
Section: C) Dynamic Cross-attention Modelmentioning
confidence: 99%
“…However, these methods fail to effectively capture the complementary intermodal relationships. Unlike these approaches, Praveen et al [2,12,13] explored CA to capture the complementary relationship across audio and visual modalities and showed improvement over prior methods. Although [2,12,13] achieved superior performance, they failed to deal with weak complementary relationships.…”
Section: C) Dynamic Cross-attention Modelmentioning
confidence: 99%
See 1 more Smart Citation