2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) 2022
DOI: 10.1109/cvprw56347.2022.00278
|View full text |Cite
|
Sign up to set email alerts
|

A Joint Cross-Attention Model for Audio-Visual Fusion in Dimensional Emotion Recognition

Abstract: In video-based emotion recognition, audio and visual modalities are often expected to have a complementary relationship, which is widely explored using cross-attention. However, they may also exhibit weak complementary relationships, resulting in poor representations of audio-visual features, thus degrading the performance of the system. To address this issue, we propose Dynamic Cross-Attention (DCA) that can dynamically select cross-attended or unattended features on the fly based on their strong or weak comp… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
5
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 48 publications
(13 citation statements)
references
References 79 publications
0
5
0
Order By: Relevance
“…Since multiple modes contain different information, it is an urgent problem to effectively capture the complementary relationship between different mode information. To combine different modes reliably, we introduce a cross‐attention fusion mechanism 50 to effectively mine the information between different modalities, while preserving the features intra‐modality. Specifically, we first extract the deep features which denote as Xada×L$$ {X}_a\in {\mathrm{\mathbb{R}}}^{d_a\times L} $$ and Xadv×L$$ {X}_a\in {\mathrm{\mathbb{R}}}^{d_v\times L} $$ from different modes.…”
Section: The Proposed Methodsmentioning
confidence: 99%
“…Since multiple modes contain different information, it is an urgent problem to effectively capture the complementary relationship between different mode information. To combine different modes reliably, we introduce a cross‐attention fusion mechanism 50 to effectively mine the information between different modalities, while preserving the features intra‐modality. Specifically, we first extract the deep features which denote as Xada×L$$ {X}_a\in {\mathrm{\mathbb{R}}}^{d_a\times L} $$ and Xadv×L$$ {X}_a\in {\mathrm{\mathbb{R}}}^{d_v\times L} $$ from different modes.…”
Section: The Proposed Methodsmentioning
confidence: 99%
“…The self-attention is employed to model the intra-modal dependencies, while the cross-attention regards the representation of other modality as the query vector to capture the inter-model relationship. After that, we adopt a fusion method similar to [9] to further fuse two modalities.…”
Section: Attention-based Cross-modal Feature Fusionmentioning
confidence: 99%
“…Despite the above research progress, KWS in realistic environments with a low SNR is still challenging. Since signals from audio modality can be susceptible to contamination by the complex environment, multi-modal based methods have been well explored in various tasks, including emotion recognition [9], scene classification [10], etc., aiming to provide complementary information for the systems. These studies have proven that the introduction of visual modality can improve the robustness of systems.…”
Section: Introductionmentioning
confidence: 99%
“…Praveen et. al [35] improved the performance of the emotion recognition model by using a cross-attention module to fuse audio-visual features. Shriwardhana et.…”
Section: Related Workmentioning
confidence: 99%