The recent spread of low-cost and high-quality RGB-D and infrared sensors has supported the development of Natural User Interfaces (NUIs) in which the interaction is carried without the use of physical devices such as keyboards and mouse. In this paper, we propose a NUI based on dynamic hand gestures, acquired with RGB, depth and infrared sensors. The system is developed for the challenging automotive context, aiming at reducing the driver’s distraction during the driving activity. Specifically, the proposed framework is based on a multimodal combination of Convolutional Neural Networks whose input is represented by depth and infrared images, achieving a good level of light invariance, a key element in vision-based in-car systems. We test our system on a recent multimodal dataset collected in a realistic automotive setting, placing the sensors in an innovative point of view, i.e., in the tunnel console looking upwards. The dataset consists of a great amount of labelled frames containing 12 dynamic gestures performed by multiple subjects, making it suitable for deep learning-based approaches. In addition, we test the system on a different well-known public dataset, created for the interaction between the driver and the car. Experimental results on both datasets reveal the efficacy and the real-time performance of the proposed method.
In this paper, we propose a multimodal deep learning architecture for emotion recognition in video regarding our participation to the audio-video based sub-challenge of the Emotion Recognition in the Wild 2017 challenge. Our model combines cues from multiple video modalities, including static facial features, motion patterns related to the evolution of the human expression over time, and audio information. Specifically, it is composed of three sub-networks trained separately: the first and second ones extract static visual features and dynamic patterns through 2D and 3D Convolutional Neural Networks (CNN), while the third one consists in a pretrained audio network which is used to extract useful deep acoustic signals from video. In the audio branch, we also apply Long Short Term Memory (LSTM) networks in order to capture the temporal evolution of the audio features. To identify and exploit possible relationships among different modalities, we propose a fusion network that merges cues from the different modalities in one representation. The proposed architecture outperforms the challenge baselines (38.81% and 40.47%): we achieve an accuracy of 50.39% and 49.92% respectively on the validation and the testing data. CCS CONCEPTS • Computing methodologies → Activity recognition and understanding; Neural networks;
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.