Multimodal emotional state recognition using sequence-dependent deep hierarchical features

Barros, Pablo; Jirak, Doreen; Weber, Cornelius; Wermter, Stefan

doi:10.1016/j.neunet.2015.09.009

Cited by 67 publications

(30 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…When using 3D CNN for spatio-temporal modeling of image sequences as discussed in Section 3.1.1, the line between spatial and temporal representation learning can be blurred. While this approach is typically limited to very short sequences, with further pooling steps necessary to derive sequence-level labels (e.g., [84], [85]), in some cases spatio-temporal features can be derived for entire (short) sequences. For example, Gupta et al [62] used a variant called slow fusion [153], which treats the time domain like a spatial domain, progressively learning low-level to highlevel temporal features.…”

Section: Learning Temporal Features For Fermentioning

confidence: 99%

Deep Learning for Human Affect Recognition: Insights and New Developments

Rouast

Adam

Chiong

2021

IEEE Trans. Affective Comput.

158

View full text Add to dashboard Cite

Automatic human affect recognition is a key step towards more natural human-computer interaction. Recent trends include recognition in the wild using a fusion of audiovisual and physiological sensors, a challenging setting for conventional machine learning algorithms. Since 2010, novel deep learning algorithms have been applied increasingly in this field. In this paper, we review the literature on human affect recognition between 2010 and 2017, with a special focus on approaches using deep neural networks. By classifying a total of 950 studies according to their usage of shallow or deep architectures, we are able to show a trend towards deep learning. Reviewing a subset of 233 studies that employ deep neural networks, we comprehensively quantify their applications in this field. We find that deep learning is used for learning of (i) spatial feature representations, (ii) temporal feature representations, and (iii) joint feature representations for multimodal sensor data. Exemplary state-of-the-art architectures illustrate the progress. Our findings show the role deep architectures will play in human affect recognition, and can serve as a reference point for researchers working on related applications.

show abstract

Section: Learning Temporal Features For Fermentioning

confidence: 99%

Deep Learning for Human Affect Recognition: Insights and New Developments

Rouast

Adam

Chiong

2021

IEEE Trans. Affective Comput.

158

View full text Add to dashboard Cite

show abstract

“…To be able to deal with multimodal data, our network uses the concept of the CCCNN by Barros, Jirak, Weber, and Wermter (2015a). In the CCCNN architecture, several channels, each one of them composed of an independent sequence of convolution and pooling layers, are fully connected at the end to a crosschannel layer, which is composed of convolution and pooling layers, and trained as one single architecture.…”

Section: Emotion Expression Representationmentioning

confidence: 99%

Developing crossmodal expression recognition based on a deep neural model

Barros

Wermter

2016

Adaptive Behavior

Self Cite

View full text Add to dashboard Cite

A robot capable of understanding emotion expressions can increase its own capability of solving problems by using emotion expressions as part of its own decision-making, in a similar way to humans. Evidence shows that the perception of human interaction starts with an innate perception mechanism, where the interaction between different entities is perceived and categorized into two very clear directions: positive or negative. While the person is developing during childhood, the perception evolves and is shaped based on the observation of human interaction, creating the capability to learn different categories of expressions. In the context of human–robot interaction, we propose a model that simulates the innate perception of audio–visual emotion expressions with deep neural networks, that learns new expressions by categorizing them into emotional clusters with a self-organizing layer. The proposed model is evaluated with three different corpora: The Surrey Audio–Visual Expressed Emotion (SAVEE) database, the visual Bi-modal Face and Body benchmark (FABO) database, and the multimodal corpus of the Emotion Recognition in the Wild (EmotiW) challenge. We use these corpora to evaluate the performance of the model to recognize emotional expressions, and compare it to state-of-the-art research.

show abstract

“…Chen et al [129] used HOG on the motion history image (MHI) for finding the direction and speed, and Image-HOG features from bag of words (BOW) to compute appearance features. Another example is the usage of a multichannel CNN for learning a deep representation from the upper part of the body [130]. Finally, Botzheim et al [131] used spiking neural networks for temporal coding.…”

Section: Representation Learningmentioning

confidence: 99%

“…2) for learning such representations in a supervised way. Two of the very few works that uses deep learning representations for body emotion recognition are multichannel CNN from upper body [130] and spiking neural networks for temporal coding [131]. As previously discussed in Sec.…”

Section: Representation Learning and Emotion Recognitionmentioning

confidence: 99%

Survey on Emotional Body Gesture Recognition

Noroozi

Corneanu

Kamińska

et al. 2021

IEEE Trans. Affective Comput.

305

130

View full text Add to dashboard Cite

Automatic emotion recognition has become a trending research topic in the past decade. While works based on facial expressions or speech abound recognizing affect from body gestures remains a less explored topic. We present a new comprehensive survey hoping to boost research in the field. We first introduce emotional body gestures as a component of what is commonly known as "body language" and comment general aspects as gender differences and culture dependence. We then define a complete framework for automatic emotional body gesture recognition. We introduce person detection and comment static and dynamic body pose estimation methods both in RGB and 3D. We then comment the recent literature related to representation learning and emotion recognition from images of emotionally expressive gestures. We also discuss multi-modal approaches that combine speech or face with body gestures for improved emotion recognition. While pre-processing methodologies (e.g. human detection and pose estimation) are nowadays mature technologies fully developed for robust large scale analysis, we show that for emotion recognition the quantity of labelled data is scarce, there is no agreement on clearly defined output spaces and the representations are shallow and largely based on naive geometrical representations.

show abstract

Multimodal emotional state recognition using sequence-dependent deep hierarchical features

Cited by 67 publications

References 31 publications

Deep Learning for Human Affect Recognition: Insights and New Developments

Deep Learning for Human Affect Recognition: Insights and New Developments

Developing crossmodal expression recognition based on a deep neural model

Survey on Emotional Body Gesture Recognition

Contact Info

Product

Resources

About