Multi-task Learning for Human Affect Prediction with Auditory–Visual Synchronized Representation

Jeong, Euiseok; Oh, Geesung; Lim, Sejoon

doi:10.1109/cvprw56347.2022.00272

Cited by 3 publications

(2 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The Action Unit Detection Challenge of the 5th ABAW Competition [20] is based on the Aff-Wild2 [16-19, 21-24, 55] database. Some of the AU detection approaches in the previous ABAW Competitions [16,17,24] fuse multimodal features including video and audio to provide multidimensional information to predict AUs' occurrence [13,14,50,58]. Meanwhile, other studies found that AU detection performance can be benefited from multi-task learning [3,13,36,56], i.e., jointly conducting expression recognition or valence/arousal estimation provides helpful cues for AU detection.…”

Section: Introductionmentioning

confidence: 99%

“…Some of the AU detection approaches in the previous ABAW Competitions [16,17,24] fuse multimodal features including video and audio to provide multidimensional information to predict AUs' occurrence [13,14,50,58]. Meanwhile, other studies found that AU detection performance can be benefited from multi-task learning [3,13,36,56], i.e., jointly conducting expression recognition or valence/arousal estimation provides helpful cues for AU detection. Moreover, temporal models such as GRU [6] or Transformer [48] are also introduced to model temporal dynamics among consecutive frames [37,50].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Spatio-Temporal AU Relational Graph Representation Learning For Facial Action Units Detection

Wang¹,

Song²,

Luo³

et al. 2023

Preprint

View full text Add to dashboard Cite

This paper presents our Facial Action Units (AUs) recognition submission to the fifth Affective Behavior Analysis in-the-wild Competition (ABAW). Our approach consists of three main modules: (i) a pre-trained facial representation encoder which produce a strong facial representation from each input face image in the input sequence; (ii) an AUspecific feature generator that specifically learns a set of AU features from each facial representation; and (iii) a spatiotemporal graph learning module that constructs a spatiotemporal graph representation. This graph representation describes AUs contained in all frames and predicts the occurrence of each AU based on both the modeled spatial information within the corresponding face and the learned temporal dynamics among frames. The experimental results show that our approach outperformed the baseline and the spatio-temporal graph representation learning allows our model to generate the best results among all ablated systems. Our model ranks at the 4th place in the AU recognition track at the 5th ABAW Competition.

show abstract