In this paper we present the techniques used for the University of Montréal's team submissions to the 2013 Emotion Recognition in the Wild Challenge. The challenge is to classify the emotions expressed by the primary human subject in short video clips extracted from feature length movies. This involves the analysis of video clips of acted scenes lasting approximately one-two seconds, including the audio track which may contain human voices as well as background music. Our approach combines multiple deep neural networks for different data modalities, including: (1) a deep convolutional neural network for the analysis of facial expressions within video frames; (2) a deep belief net to capture audio information; (3) a deep autoencoder to model the spatiotemporal information produced by the human actions depicted within the entire scene; and (4) a shallow network architecture focused on extracted features of the mouth of the primary human subject in the scene. We discuss each of these techniques, their performance characteristics and different strategies to aggregate their predictions. Our best single model was a convolutional neural network trained to predict emotions from static frames using two large data sets, the Toronto Face Database and our own set of faces images harvested from Google image search, followed by a per frame aggregation strategy that used the challenge training data. This yielded a test set accuracy of 35.58%. Using our best strategy for aggregating our top performing models into a single predictor we were able to produce an accuracy of 41.03% on the challenge test set. These compare favorably to the challenge baseline test set accuracy of 27.56%.
The task of the emotion recognition in the wild (EmotiW) Challenge is to assign one of seven emotions to short video clips extracted from Hollywood style movies. The videos depict acted-out emotions under realistic conditions with a large degree of variation in attributes such as pose and illumination, making it worthwhile to explore approaches which consider combinations of features from multiple modalities for label assignment.In this paper we present our approach to learning several specialist models using deep learning techniques, each focusing on one modality. Among these are a convolutional neural network, focusing on capturing visual information in detected faces, a deep belief net focusing on the representation of the audio stream, a K-Means based "bag-of-mouths" model, which extracts visual features around the mouth region and a relational autoencoder, which addresses spatio-temporal aspects of videos.We explore multiple methods for the combination of cues from these modalities into one common classifier. This achieves a considerably greater accuracy than predictions from our strongest single-modality classifier. Our method was the winning submission in the 2013 EmotiW challenge and achieved a test set accuracy of 47.67% on the 2014 dataset.
Abstract:We present an approach to predicting velocity and direction changes from visual information ("visual odometry") using an end-to-end, deep learning-based architecture. The architecture uses a single type of computational module and learning rule to extract visual motion, depth, and finally odometry information from the raw data. Representations of depth and motion are extracted by detecting synchrony across time and stereo channels using network layers with multiplicative interactions. The extracted representations are turned into information about changes in velocity and direction using a convolutional neural network. Preliminary results show that the architecture is capable of learning the resulting mapping from video to egomotion.
We developed a robust real time hand gesture based interaction system to effectively communicate with a mobile robot which can operate in an outdoor environment. The system enables the user to operate a mobile robot using hand gesture based commands. In particular the system offers direct on site interaction providing better perception of environment to the user. To overcome the illumination challenges in outdoors, the system operates on depth images. Processed depth images are given as input to a convolutional neural network which is trained to detect static hand gestures. The system is evaluated in real world experiments on a mobile robot to show the operational efficiency in outdoor environment
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.