The accurate recognition of emotions in conversations helps understand the speaker’s intentions and facilitates various analyses in artificial intelligence, especially in human–computer interaction systems. However, most previous methods need more ability to track the different emotional states of each speaker in a dialogue. To alleviate this dilemma, we propose a new approach, Multi-Task Learning and Multi-Fusion AudioText Emotion Recognition in Conversation (MMATERIC) for emotion recognition in conversation. MMATERIC can refer to and combine the benefits of two distinct tasks: emotion recognition in text and emotion recognition in speech, and production of fused multimodal features to recognize the emotions of different speakers in dialogue. At the core of MATTERIC are three modules: an encoder with multimodal attention, a speaker emotion detection unit (SED-Unit), and a decoder with speaker emotion detection Bi-LSTM (SED-Bi-LSTM). Together, these three modules model the changing emotions of a speaker at a given moment in a conversation. Meanwhile, we adopt multiple fusion strategies in different stages, mainly using model fusion and decision stage fusion to improve the model’s accuracy. Simultaneously, our multimodal framework allows features to interact across modalities and allows potential adaptation flows from one modality to another. Our experimental results on two benchmark datasets show that our proposed method is effective and outperforms the state-of-the-art baseline methods. The performance improvement of our method is mainly attributed to the combination of three core modules of MATTERIC and the different fusion methods we adopt in each stage.
In the Knowledge Grounded Dialogue (KGD) generation, the explicit modeling of instance-variety of knowledge specificity and its seamless fusion with the dialogue context remains challenging. This paper presents an innovative approach, the Knowledge Interpolated conditional Variational auto-encoder (KIV), to address these issues. In particular, KIV introduces a novel interpolation mechanism to fuse two latent variables: independently encoding dialogue context and grounded knowledge. This distinct fusion of context and knowledge in the semantic space enables the interpolated latent variable to guide the decoder toward generating more contextually rich and engaging responses. We further explore deterministic and probabilistic methodologies to ascertain the interpolation weight, capturing the level of knowledge specificity. Comprehensive empirical analysis conducted on the Wizard-of-Wikipedia and Holl-E datasets verifies that the responses generated by our model performs better than strong baselines, with notable performance improvements observed in both automatic metrics and manual evaluation.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.