Multimodal emotion detection (MED) in interactive conversations is extremely important for improving the overall human-computer interaction experience. Present research methods in this domain do not explicitly distinguish the contexts of a test utterance in a meaningful way while classifying emotions in conversations. In this paper, we propose a model, named different contextual window sizes based recurrent neural networks (DCWS-RNNs), to differentiate the contexts. The model has four recurrent neural networks (RNNs) that use different contextual window sizes. These window sizes can represent the implicit weights of different aspects of contexts. Further, four RNNs are independently to model the different aspects of contexts into memories. Such memories with the test utterance are then merged using attention-based multiple hops. Experiments show DCWS-RNNs outperforms the compared methods on both the IEMOCAP and AVEC datasets. Case studies on the IEMOCAP dataset also demonstrate that our model has excellent performance to capture the emotional dependent utterance that is most relevant to the test utterance and assigned to the highest attention score. INDEX TERMS Interactive conversations, contextual window sizes, emotion detection, multimodal, recurrent neural network.