Emotion recognition in conversations (ERC) has received much attention recently in the natural language processing community. Considering that the emotions of the utterances in conversations are interactive, previous works usually implicitly model the emotion interaction between utterances by modeling dialogue context, but the misleading emotion information from context often interferes with the emotion interaction. We noticed that the gold emotion labels of the context utterances can provide explicit and accurate emotion interaction, but it is impossible to input gold labels at inference time. To address this problem, we propose an iterative emotion interaction network, which uses iteratively predicted emotion labels instead of gold emotion labels to explicitly model the emotion interaction. This approach solves the above problem, and can effectively retain the performance advantages of explicit modeling. We conduct experiments on two datasets, and our approach achieves state-of-the-art performance.
Multimodal fusion is a core problem for multimodal sentiment analysis. Previous works usually treat all three modal features equally and implicitly explore the interactions between different modalities. In this paper, we break this kind of methods in two ways. Firstly, we observe that textual modality plays the most important role in multimodal sentiment analysis, and this can be seen from the previous works. Secondly, we observe that comparing to the textual modality, the other two kinds of nontextual modalities (visual and acoustic) can provide two kinds of semantics, shared and private semantics. The shared semantics from the other two modalities can obviously enhance the textual semantics and make the sentiment analysis model more robust, and the private semantics can be complementary to the textual semantics and meanwhile provide different views to improve the performance of sentiment analysis together with the shared semantics. Motivated by these two observations, we propose a text-centered shared-private framework (TCSP) for multimodal fusion, which consists of the cross-modal prediction and sentiment regression parts. Experiments on the MOSEI and MOSI datasets demonstrate the effectiveness of our shared-private framework, which outperforms all baselines. Furthermore, our approach provides a new way to utilize the unlabeled data for multimodal sentiment analysis.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.