Abstract-In the last years, several efforts have been devoted to the automatic recognition of human emotions. On the one side, there are several works based on speech processing and on the other side, using facial expressions in still images. More recently, other modalities such as body gestures, biosignals and others have been started to be used. In this work we present a multimodal system that process audiovisual information, exploiting the prosodic features in the speech and the development of the facial expressions in videos. The classification of the video in one of six emotions is carried out by deep networks, a neural network architecture consisting of several layers that capture high-order correlations betwen the features. The obtained results show the suitability of the proposed approach for this task, improving the performance of standard multilayer Perceptrons.