FERV39k: A Large-Scale Multi-Scene Dataset for Facial Expression Recognition in Videos

Wang, Yan; Sun, Yixuan; Huang, Yi‐Wen; Liu, Zhongying; Gao, Shuyong; Zhang, Wei; Ge, Weifeng; Zhang, Wenqiang

doi:10.1109/cvpr52688.2022.02025

Cited by 55 publications

(14 citation statements)

References 39 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We use two in-the-wild DFER datasets (i.e., DFEW [14] and FERV39K [15]) to evaluate our proposed method. For both DFEW and FERV39K, the processed face region images are officially detected, aligned and publicly available.…”

Section: Methodsmentioning

confidence: 99%

Logo-Former: Local-Global Spatio-Temporal Transformer for Dynamic Facial Expression Recognition

Sun

2023

ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Previous methods for dynamic facial expression recognition (DFER) in the wild are mainly based on Convolutional Neural Networks (CNNs), whose local operations ignore the longrange dependencies in videos. Transformer-based methods for DFER can achieve better performances but result in higher FLOPs and computational costs. To solve these problems, the local-global spatio-temporal Transformer (LOGO-Former) is proposed to capture discriminative features within each frame and model contextual relationships among frames while balancing the complexity. Based on the priors that facial muscles move locally and facial expressions gradually change, we first restrict both the space attention and the time attention to a local window to capture local interactions among feature tokens. Furthermore, we perform the global attention by querying a token with features from each local window iteratively to obtain long-range information of the whole video sequence. In addition, we propose the compact loss regularization term to further encourage the learned features have the minimum intra-class distance and the maximum inter-class distance. Experiments on two in-the-wild dynamic facial expression datasets (i.e., DFEW and FERV39K) indicate that our method provides an effective way to make use of the spatial and temporal dependencies for DFER.

show abstract

Section: Methodsmentioning

confidence: 99%

Logo-Former: Local-Global Spatio-Temporal Transformer for Dynamic Facial Expression Recognition

Sun

2023

ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…MAE is a self-supervised model that trains label-free data by masking random patches from the input image and reconstructing the missing patches in the pixel space. We used a face dataset of scale 1.2 million, including DFEW [24], Emotionet [25], FERV39k [26] and so on, to pre-train the MAE encoder. We denotes this kind of feature as mae, and the dimension of it is 768.…”

Section: Visual Featuresmentioning

confidence: 99%

Yaodong Mining Co., Ltd., Zhashui County, Yang X v. Beijing Datang Fuel Co., Ltd., Tianjin Kaize International Trade Co., Ltd.

Liu

2022

Library of Selected Cases From the Chinese Court

View full text Add to dashboard Cite

This paper presents our submission to the Expression Classification Challenge of the fifth Affective Behavior Analysis in-the-wild (ABAW) Competition. In our method, multimodal feature combinations extracted by several different pre-trained models are applied to capture more effective emotional information. For these combinations of visual and audio modal features, we utilize two temporal encoders to explore the temporal contextual information in the data. In addition, we employ several ensemble strategies for different experimental settings to obtain the most accurate expression recognition results. Our system achieves the average F1 Score of 0.45774 on the validation set.

show abstract

“…Researchers have proposed various techniques to effectively improve the performance of DFER methods in the laboratory scenarios (Yu et al 2018;Jeong, Kim, and Dong 2020). Compared with lab-controlled DFER datasets, the in-thewild ones are closer to the natural facial events and can provide more diverse data by collecting video sequences from the internet, such as Aff-Wild (Zafeiriou et al 2017), DFEW (Jiang et al 2020), and FERV39k (Wang et al 2022). As shown in Figure 1, the video sequences in the real world with different expression intensities could result in the problem that the inter-class distance becomes smaller than the intra-class distance.…”

Section: Introductionmentioning

confidence: 99%

“…For DFER in the wild, the early works are mainly designed based on the hand-crafted features, like LBP-TOP (Dhall et al 2013), STLMBP (Huang et al 2014), and HOG-TOP (Chen et al 2014). In recent years, with the development of parallel computing hardware and collection of largescale DFER datasets (Wang et al 2022;Jiang et al 2020), deep learning-based methods have gradually replaced the algorithms based on hand-crafted features and achieved stateof-the-art performance on the in-the-wild DFER datasets. For instance, vision transformer (ViT) (Dosovitskiy et al 2020) has obtained promising results on many computer vision tasks, which inspires many researchers to build DFER models based on ViT Ma, Sun, and Li 2022).…”

Section: Introductionmentioning

confidence: 99%

Intensity-Aware Loss for Dynamic Facial Expression Recognition in the Wild

Niu

Zhu

et al. 2023

AAAI

View full text Add to dashboard Cite

Compared with the image-based static facial expression recognition (SFER) task, the dynamic facial expression recognition (DFER) task based on video sequences is closer to the natural expression recognition scene. However, DFER is often more challenging. One of the main reasons is that video sequences often contain frames with different expression intensities, especially for the facial expressions in the real-world scenarios, while the images in SFER frequently present uniform and high expression intensities. Nevertheless, if the expressions with different intensities are treated equally, the features learned by the networks will have large intra-class and small inter-class differences, which are harmful to DFER. To tackle this problem, we propose the global convolution-attention block (GCA) to rescale the channels of the feature maps. In addition, we introduce the intensity-aware loss (IAL) in the training process to help the network distinguish the samples with relatively low expression intensities. Experiments on two in-the-wild dynamic facial expression datasets (i.e., DFEW and FERV39k) indicate that our method outperforms the state-of-the-art DFER approaches. The source code will be available at https://github.com/muse1998/IAL-for-Facial-Expression-Recognition.

show abstract

FERV39k: A Large-Scale Multi-Scene Dataset for Facial Expression Recognition in Videos

Cited by 55 publications

References 39 publications

Logo-Former: Local-Global Spatio-Temporal Transformer for Dynamic Facial Expression Recognition

Logo-Former: Local-Global Spatio-Temporal Transformer for Dynamic Facial Expression Recognition

Yaodong Mining Co., Ltd., Zhashui County, Yang X v. Beijing Datang Fuel Co., Ltd., Tianjin Kaize International Trade Co., Ltd.

Intensity-Aware Loss for Dynamic Facial Expression Recognition in the Wild

Contact Info

Product

Resources

About