Multimodal emotion detection (MED) in interactive conversations is extremely important for improving the overall human-computer interaction experience. Present research methods in this domain do not explicitly distinguish the contexts of a test utterance in a meaningful way while classifying emotions in conversations. In this paper, we propose a model, named different contextual window sizes based recurrent neural networks (DCWS-RNNs), to differentiate the contexts. The model has four recurrent neural networks (RNNs) that use different contextual window sizes. These window sizes can represent the implicit weights of different aspects of contexts. Further, four RNNs are independently to model the different aspects of contexts into memories. Such memories with the test utterance are then merged using attention-based multiple hops. Experiments show DCWS-RNNs outperforms the compared methods on both the IEMOCAP and AVEC datasets. Case studies on the IEMOCAP dataset also demonstrate that our model has excellent performance to capture the emotional dependent utterance that is most relevant to the test utterance and assigned to the highest attention score. INDEX TERMS Interactive conversations, contextual window sizes, emotion detection, multimodal, recurrent neural network.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.