Proceedings of the 19th ACM International Conference on Multimodal Interaction 2017
DOI: 10.1145/3136755.3143006
|View full text |Cite
|
Sign up to set email alerts
|

Modeling multimodal cues in a deep learning-based framework for emotion recognition in the wild

Abstract: In this paper, we propose a multimodal deep learning architecture for emotion recognition in video regarding our participation to the audio-video based sub-challenge of the Emotion Recognition in the Wild 2017 challenge. Our model combines cues from multiple video modalities, including static facial features, motion patterns related to the evolution of the human expression over time, and audio information. Specifically, it is composed of three sub-networks trained separately: the first and second ones extract … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
19
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
7
1

Relationship

0
8

Authors

Journals

citations
Cited by 38 publications
(19 citation statements)
references
References 38 publications
0
19
0
Order By: Relevance
“…Multiple studies have shown that transfer learning can improve model accuracy by leveraging additional sources of related knowledge (e.g., from other paralinguistic tasks [136], various standard databases [137], and different affect representations [135]). SoundNet [138], a 1D CNN trained with unlabeled video, has been shown to perform well in SER even without fine-tuning [139], and was featured in a challenge-winning submission [17]. Semi-supervised learning can give access to knowledge contained in unlabeled datasets [140].…”
Section: Learning Spatial Features For Sermentioning
confidence: 99%
See 1 more Smart Citation
“…Multiple studies have shown that transfer learning can improve model accuracy by leveraging additional sources of related knowledge (e.g., from other paralinguistic tasks [136], various standard databases [137], and different affect representations [135]). SoundNet [138], a 1D CNN trained with unlabeled video, has been shown to perform well in SER even without fine-tuning [139], and was featured in a challenge-winning submission [17]. Semi-supervised learning can give access to knowledge contained in unlabeled datasets [140].…”
Section: Learning Spatial Features For Sermentioning
confidence: 99%
“…We are starting to see studies implementing end-toend training for such models [54], [108], however in this setting the problem of limited labeled data becomes especially noticeable [63], [139].…”
mentioning
confidence: 99%
“…Instead of directly using C3D for classification, [109] employed C3D for spatio-temporal feature extraction and then cascaded with DBN for prediction. In [201], C3D was also used as a feature extractor, followed by a NetVLAD layer [202] to aggregate the temporal information of the motion features by learning cluster centers.…”
Section: Rnn and C3dmentioning
confidence: 99%
“…In the FER task, a 3DCNN and its derived network structure have demonstrated excellent recognition effect [ 34 ]. Pini et al applied C3D (convolutional 3D) as a feature extractor to obtain multichannel static and dynamic visual features and audio features, and they fused the network to extract spatial–temporal features [ 35 ]. Hasani et al obtained the Hadamard product between a facial feature point vector and a feature vector in an inflated 3D convolution network (I3D) and cascaded an RNN to realize end-to-end network training [ 19 ].…”
Section: Related Workmentioning
confidence: 99%