The use of emotional states for Human-Robot Interaction (HRI) has attracted considerable attention in recent years. One of the most challenging tasks is to recognize the spontaneous expression of emotions, especially in an HRI scenario. Every person has a different way to express emotions, and this is aggravated by the complexity of interaction with different subjects, multimodal information and different environments. We propose a deep neural model which is able to deal with these characteristics and which is applied in recognition of complex mental states. Our system is able to learn and extract deep spatial and temporal features and to use them to classify emotions in sequences. To evaluate the system, the CAM3D corpus is used. This corpus is composed of videos recorded from different subjects and in different indoor environments. Each video contains the recording of the upper-body part of the subject expressing one of twelve complex mental states. Our system is able to recognize spontaneous complex mental states from different subjects and can be used in such an HRI scenario.