Modeling multimodal cues in a deep learning-based framework for emotion recognition in the wild

Pini, Stefano; Ahmed, Olfa Ben; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita; Huet, Benoît

doi:10.1145/3136755.3143006

Cited by 38 publications

(19 citation statements)

References 38 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Multiple studies have shown that transfer learning can improve model accuracy by leveraging additional sources of related knowledge (e.g., from other paralinguistic tasks [136], various standard databases [137], and different affect representations [135]). SoundNet [138], a 1D CNN trained with unlabeled video, has been shown to perform well in SER even without fine-tuning [139], and was featured in a challenge-winning submission [17]. Semi-supervised learning can give access to knowledge contained in unlabeled datasets [140].…”

Section: Learning Spatial Features For Sermentioning

confidence: 99%

See 1 more Smart Citation

Deep Learning for Human Affect Recognition: Insights and New Developments

Rouast

Adam

Chiong

2021

IEEE Trans. Affective Comput.

158

View full text Add to dashboard Cite

Automatic human affect recognition is a key step towards more natural human-computer interaction. Recent trends include recognition in the wild using a fusion of audiovisual and physiological sensors, a challenging setting for conventional machine learning algorithms. Since 2010, novel deep learning algorithms have been applied increasingly in this field. In this paper, we review the literature on human affect recognition between 2010 and 2017, with a special focus on approaches using deep neural networks. By classifying a total of 950 studies according to their usage of shallow or deep architectures, we are able to show a trend towards deep learning. Reviewing a subset of 233 studies that employ deep neural networks, we comprehensively quantify their applications in this field. We find that deep learning is used for learning of (i) spatial feature representations, (ii) temporal feature representations, and (iii) joint feature representations for multimodal sensor data. Exemplary state-of-the-art architectures illustrate the progress. Our findings show the role deep architectures will play in human affect recognition, and can serve as a reference point for researchers working on related applications.

show abstract

Section: Learning Spatial Features For Sermentioning

confidence: 99%

“…We are starting to see studies implementing end-toend training for such models [54], [108], however in this setting the problem of limited labeled data becomes especially noticeable [63], [139].…”

mentioning

confidence: 99%

Deep Learning for Human Affect Recognition: Insights and New Developments

Rouast

Adam

Chiong

2021

IEEE Trans. Affective Comput.

158

View full text Add to dashboard Cite

show abstract

“…Instead of directly using C3D for classification, [109] employed C3D for spatio-temporal feature extraction and then cascaded with DBN for prediction. In [201], C3D was also used as a feature extractor, followed by a NetVLAD layer [202] to aggregate the temporal information of the motion features by learning cluster centers.…”

Section: Rnn and C3dmentioning

confidence: 99%

Deep Facial Expression Recognition: A Survey

Deng

2022

IEEE Trans. Affective Comput.

1,082

593

View full text Add to dashboard Cite

With the transition of facial expression recognition (FER) from laboratory-controlled to challenging in-the-wild conditions and the recent success of deep learning techniques in various fields, deep neural networks have increasingly been leveraged to learn discriminative representations for automatic FER. Recent deep FER systems generally focus on two important issues: overfitting caused by a lack of sufficient training data and expression-unrelated variations, such as illumination, head pose and identity bias. In this paper, we provide a comprehensive survey on deep FER, including datasets and algorithms that provide insights into these intrinsic problems. First, we introduce the available datasets that are widely used in the literature and provide accepted data selection and evaluation principles for these datasets. We then describe the standard pipeline of a deep FER system with the related background knowledge and suggestions of applicable implementations for each stage. For the state of the art in deep FER, we review existing novel deep neural networks and related training strategies that are designed for FER based on both static images and dynamic image sequences, and discuss their advantages and limitations. Competitive performances on widely used benchmarks are also summarized in this section. We then extend our survey to additional related issues and application scenarios. Finally, we review the remaining challenges and corresponding opportunities in this field as well as future directions for the design of robust deep FER systems.

show abstract

“…In the FER task, a 3DCNN and its derived network structure have demonstrated excellent recognition effect [ 34 ]. Pini et al applied C3D (convolutional 3D) as a feature extractor to obtain multichannel static and dynamic visual features and audio features, and they fused the network to extract spatial–temporal features [ 35 ]. Hasani et al obtained the Hadamard product between a facial feature point vector and a feature vector in an inflated 3D convolution network (I3D) and cascaded an RNN to realize end-to-end network training [ 19 ].…”

Section: Related Workmentioning

confidence: 99%

Hybrid Attention Cascade Network for Facial Expression Recognition

Zhu

Zhao

et al. 2021

Sensors

View full text Add to dashboard Cite

As a sub-challenge of EmotiW (the Emotion Recognition in the Wild challenge), how to improve performance on the AFEW (Acted Facial Expressions in the wild) dataset is a popular benchmark for emotion recognition tasks with various constraints, including uneven illumination, head deflection, and facial posture. In this paper, we propose a convenient facial expression recognition cascade network comprising spatial feature extraction, hybrid attention, and temporal feature extraction. First, in a video sequence, faces in each frame are detected, and the corresponding face ROI (range of interest) is extracted to obtain the face images. Then, the face images in each frame are aligned based on the position information of the facial feature points in the images. Second, the aligned face images are input to the residual neural network to extract the spatial features of facial expressions corresponding to the face images. The spatial features are input to the hybrid attention module to obtain the fusion features of facial expressions. Finally, the fusion features are input in the gate control loop unit to extract the temporal features of facial expressions. The temporal features are input to the fully connected layer to classify and recognize facial expressions. Experiments using the CK+ (the extended Cohn Kanade), Oulu-CASIA (Institute of Automation, Chinese Academy of Sciences) and AFEW datasets obtained recognition accuracy rates of 98.46%, 87.31%, and 53.44%, respectively. This demonstrated that the proposed method achieves not only competitive performance comparable to state-of-the-art methods but also greater than 2% performance improvement on the AFEW dataset, proving the significant outperformance of facial expression recognition in the natural environment.

show abstract

Modeling multimodal cues in a deep learning-based framework for emotion recognition in the wild

Cited by 38 publications

References 38 publications

Deep Learning for Human Affect Recognition: Insights and New Developments

Deep Learning for Human Affect Recognition: Insights and New Developments

Deep Facial Expression Recognition: A Survey

Hybrid Attention Cascade Network for Facial Expression Recognition

Contact Info

Product

Resources

About