Valence and Arousal Estimation based on Multimodal Temporal-Aware Features for Videos in the Wild

Meng, Liyu; Liu, Yuchen; Liu, Xiaolong; Huang, Zhaopei; Jiang, Wenqiang; Zhang, Tenggan; Liu, Chuanhe; Jin, Qin

doi:10.1109/cvprw56347.2022.00261

Cited by 16 publications

(19 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Meng et al. use Transformer [29] and LSTM [30] encoders to capture temporal context information in the video to complete continuous emotion prediction [31]. This method won the 2022 ABAW challenge, VA track [32, 33].…”

Section: Related Workmentioning

confidence: 99%

Improvement of continuous emotion recognition of temporal convolutional networks with incomplete labels

Wang,

Zheng,

Liu

2023

IET Image Processing

View full text Add to dashboard Cite

Video‐based emotion recognition has been a long‐standing research topic for computer scientists and psychiatrists. In contrast to traditional discrete emotional models, emotion recognition based on continuous emotional models can better describe the progression of emotions. Quantitative analysis of emotions will have crucial impacts on promoting the development of intelligent products. The current solutions to continuous emotion recognition still have many issues. The original continuous emotion dataset contains incomplete data annotations, and the existing methods often ignore temporal information between frames. The following measures are taken in response to the above problems. Initially, aiming at the problem of incomplete video labels, the correlation between discrete and continuous video emotion labels is used to complete the dataset labels. This correlation is used to propose a mathematical model to fill the missing labels of the original dataset without adding data. Moreover, this paper proposes a continuous emotion recognition network based on an optimized temporal convolutional network, which adds a feature extraction submodule and a residual module to retain shallow features while improving the feature extraction ability. Finally, validation experiments on the Aff‐wild2 dataset achieved accuracies of 0.5159 and 0.65611 on the valence and arousal dimensions, respectively, by adopting the above measures.

show abstract

Section: Related Workmentioning

confidence: 99%

Improvement of continuous emotion recognition of temporal convolutional networks with incomplete labels

Wang,

Zheng,

Liu

2023

IET Image Processing

View full text Add to dashboard Cite

show abstract

“…Attention models for ER: Recently, multimodal transformers with CA showed significant improvement for ER [6,7,8]. Parthasarathy et al [9] explored multimodal transformers, where the CA module is integrated with the self-attention module to obtain the A-V cross-modal feature representa-tions.…”

Section: Related Workmentioning

confidence: 99%

“…Zhang et al [10] proposed a leader-follower attention mechanism by considering the visual modality as the primary channel, while the audio modality is used as a supplementary channel to boost visual performance. Karas et al [6] and Meng et al [8] showed improvement in fusion performance by exploring a set of fusion models based on LSTMs and transformers. Zhou et al [7] explored temporal convolutional networks (TCNs) for individual modalities, whereas Zhang et al [11] exploited masked auto-encoders for visual modality.…”

Section: Related Workmentioning

confidence: 99%

“…Zhou et al [7] explored temporal convolutional networks (TCNs) for individual modalities, whereas Zhang et al [11] exploited masked auto-encoders for visual modality. However, most of these methods [7,11,6,8] rely on a naive fusion approach or ensemble-based fusion using transformers and LSTMs. Unlike these approaches, Praveen et al [2] proposed a CA model to effectively leverage complementary relationships by allowing the modalities to interact with each other.…”

Section: Related Workmentioning

confidence: 99%

“…Zhang et al [10] exploited audio as a supplementary channel to boost the performance of visual modality. Meng et al [8] showed significant improvement in both development and test sets by leveraging three external datasets and multiple backbones for audio and visual modalities using an ensemble of LSTMs and transformers. Similarly, Zhou et al [7] and Zhang et al [11] also explored multiple backbones to achieve better generalization and improved performance on the test set.…”

Section: C) Dynamic Cross-attention Modelmentioning

confidence: 99%

See 2 more Smart Citations

A Joint Cross-Attention Model for Audio-Visual Fusion in Dimensional Emotion Recognition

Rajasekar¹,

Melo²,

Ullah³

et al. 2022

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

View full text Add to dashboard Cite

In video-based emotion recognition, audio and visual modalities are often expected to have a complementary relationship, which is widely explored using cross-attention. However, they may also exhibit weak complementary relationships, resulting in poor representations of audio-visual features, thus degrading the performance of the system. To address this issue, we propose Dynamic Cross-Attention (DCA) that can dynamically select cross-attended or unattended features on the fly based on their strong or weak complementary relationship with each other, respectively. Specifically, a simple yet efficient gating layer is designed to evaluate the contribution of the cross-attention mechanism and choose crossattended features only when they exhibit a strong complementary relationship, otherwise unattended features. We evaluate the performance of the proposed approach on the challenging RECOLA and Aff-Wild2 datasets. We also compare the proposed approach with other variants of cross-attention and show that the proposed model consistently improves the performance on both datasets.

show abstract

Multi-Task Learning Framework for Emotion Recognition In-the-Wild

Zhang

Liu²,

Liu³

et al. 2023

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Valence and Arousal Estimation based on Multimodal Temporal-Aware Features for Videos in the Wild

Cited by 16 publications

References 27 publications

Improvement of continuous emotion recognition of temporal convolutional networks with incomplete labels

Improvement of continuous emotion recognition of temporal convolutional networks with incomplete labels

A Joint Cross-Attention Model for Audio-Visual Fusion in Dimensional Emotion Recognition

Multi-Task Learning Framework for Emotion Recognition In-the-Wild

Contact Info

Product

Resources

About