Coarse-to-Fine Cascaded Networks with Smooth Predicting for Video Facial Expression Recognition

Xue, Fanglei; Tan, Zichang; Zhu, Yu; Ma, Zhong‐Qi; Guo, Guodong

doi:10.1109/cvprw56347.2022.00269

Cited by 34 publications

(5 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Since the human face contains strong salient information that is conducive to extracting more refined emotion information, such as micro-expressions [22][23][24], the research on human emotion recognition methods throughout the past decade has focused on facial expression analysis [25][26][27][28][29]. Traditional research either uses facial fiducial points based on the Gabor-feature facial point detector [30] or focuses on facial action unit detection where a set of facial muscle movements is utilized for encoding corresponding facial expressions [31,32].…”

Section: Context-aware Emotion Recognitionmentioning

confidence: 99%

Emotion Recognition from Large-Scale Video Clips with Cross-Attention and Hybrid Feature Weighting Neural Networks

Zhou

Jiang

et al. 2023

IJERPH

View full text Add to dashboard Cite

The emotion of humans is an important indicator or reflection of their mental states, e.g., satisfaction or stress, and recognizing or detecting emotion from different media is essential to perform sequence analysis or for certain applications, e.g., mental health assessments, job stress level estimation, and tourist satisfaction assessments. Emotion recognition based on computer vision techniques, as an important method of detecting emotion from visual media (e.g., images or videos) of human behaviors with the use of plentiful emotional cues, has been extensively investigated because of its significant applications. However, most existing models neglect inter-feature interaction and use simple concatenation for feature fusion, failing to capture the crucial complementary gains between face and context information in video clips, which is significant in addressing the problems of emotion confusion and emotion misunderstanding. Accordingly, in this paper, to fully exploit the complementary information between face and context features, we present a novel cross-attention and hybrid feature weighting network to achieve accurate emotion recognition from large-scale video clips, and the proposed model consists of a dual-branch encoding (DBE) network, a hierarchical-attention encoding (HAE) network, and a deep fusion (DF) block. Specifically, the face and context encoding blocks in the DBE network generate the respective shallow features. After this, the HAE network uses the cross-attention (CA) block to investigate and capture the complementarity between facial expression features and their contexts via a cross-channel attention operation. The element recalibration (ER) block is introduced to revise the feature map of each channel by embedding global information. Moreover, the adaptive-attention (AA) block in the HAE network is developed to infer the optimal feature fusion weights and obtain the adaptive emotion features via a hybrid feature weighting operation. Finally, the DF block integrates these adaptive emotion features to predict an individual emotional state. Extensive experimental results of the CAER-S dataset demonstrate the effectiveness of our method, exhibiting its potential in the analysis of tourist reviews with video clips, estimation of job stress levels with visual emotional evidence, or assessments of mental healthiness with visual media.

show abstract

Section: Context-aware Emotion Recognitionmentioning

confidence: 99%

Emotion Recognition from Large-Scale Video Clips with Cross-Attention and Hybrid Feature Weighting Neural Networks

Zhou

Jiang

et al. 2023

IJERPH

View full text Add to dashboard Cite

show abstract

“…Jeong et al [11] extended the DAN model and achieved 2nd in ABAW3. Xue et al [30] utilized a coarseto-fine cascade network with a temporal smoothing strategy and ranked 3rd in ABAW3. Zhang et al [35] found that AU, VA, and Expr representations are intrinsically associated with each other and proposed a streaming network for multi-task learning.…”

Section: Related Workmentioning

confidence: 99%

Exploring Expression-related Self-supervised Learning for Affective Behaviour Analysis

Xue¹,

Yang²

2023

Preprint

View full text Add to dashboard Cite

This paper explores an expression-related selfsupervised learning (SSL) method (ContraWarping) to perform expression classification in the 5th Affective Behavior Analysis in-the-wild (ABAW) competition. Affective datasets are expensive to annotate, and SSL methods could learn from large-scale unlabeled data, which is more suitable for this task. By evaluating on the Aff-Wild2 dataset, we demonstrate that Con-traWarping outperforms most existing supervised methods and shows great application potential in the affective analysis area. Codes will be released on: https : //github.com/youqingxiaozhua/ABAW5.

show abstract

“…Jeong et al [15]proposed a multi-head cross attention networks and pretrained on Glint360K [1] and some private commercial datasets. Xue et al [46] proposed the Coarse-to-Fine Cascaded networks (CFC) to address the label ambiguity problem and used smooth predicting method to post-process the extracted features. Savchenko et al [37] proposed the novel frame-level emotion recognition algorithm which can be implemented even for video analytics on mobile devices.…”

Section: Affective Behavior Analysis In-the-wildmentioning

confidence: 99%

Exploring Large-scale Unlabeled Faces to Enhance Facial Expression Recognition

Wang¹,

Cai²,

Li³

et al. 2023

Preprint

View full text Add to dashboard Cite

Facial Expression Recognition (FER) is an important task in computer vision and has wide applications in human-computer interaction, intelligent security, emotion analysis, and other fields. However, the limited size of FER datasets limits the generalization ability of expression recognition models, resulting in ineffective model performance. To address this problem, we propose a semisupervised learning framework that utilizes unlabeled face data to train expression recognition models effectively. Our method uses a dynamic threshold module (DTM) that can adaptively adjust the confidence threshold to fully utilize the face recognition (FR) data to generate pseudo-labels, thus improving the model's ability to model facial expressions. In the ABAW5 EXPR task, our method achieved excellent results on the official validation set.

show abstract

Coarse-to-Fine Cascaded Networks with Smooth Predicting for Video Facial Expression Recognition

Cited by 34 publications

References 28 publications

Emotion Recognition from Large-Scale Video Clips with Cross-Attention and Hybrid Feature Weighting Neural Networks

Emotion Recognition from Large-Scale Video Clips with Cross-Attention and Hybrid Feature Weighting Neural Networks

Exploring Expression-related Self-supervised Learning for Affective Behaviour Analysis

Exploring Large-scale Unlabeled Faces to Enhance Facial Expression Recognition

Contact Info

Product

Resources

About