Human emotion judgments usually receive information from multiple modalities such as language, audio, as well as facial expressions and gestures. Because different modalities are represented differently, multimodal data exhibit redundancy and complementarity, so a reasonable multimodal fusion approach is essential to improve the accuracy of sentiment analysis. Inspired by the Crossmodal Transformer for multimodal data fusion in the MulT (Multimodal Transformer) model, this paper adds the Crossmodal transformer for modal enhancement of different modal data in the fusion part of the MISA (Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis) model, and proposes three MISA-CT models. Tested on two publicly available multimodal sentiment analysis datasets MOSI and MOSEI, the experimental results of the models outperformed the original MISA model.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.