Video-level sentiment analysis is a challenging task and requires systems to obtain discriminative multimodal representations that can capture difference in sentiments across various modalities. However, due to diverse distributions of various modalities and the unified multimodal labels are not always adaptable to unimodal learning, the distance difference between unimodal representations increases, and prevents systems from learning discriminative multimodal representations. In this paper, to obtain more discriminative multimodal representations that can further improve systems' performance, we propose a VAE-based adversarial multimodal domain transfer (VAE-AMDT) and jointly train it with a multi-attention module to reduce the distance difference between unimodal representations. We first perform variational autoencoder (VAE) to make visual, linguistic and acoustic representations follow a common distribution, and then introduce adversarial training to transfer all unimodal representations to a joint embedding space. As a result, we fuse various modalities on this joint embedding space via the multi-attention module, which consists of self-attention, cross-attention and triple-attention for highlighting important sentimental representations over time and modality. Our method improves F1-score of the state-of-the-art by 3.6% on MOSI and 2.9% on MOSEI datasets, and prove its efficacy in obtaining discriminative multimodal representations for video-level sentiment analysis. INDEX TERMSMultimodal representation learning, Domain adaptation, Variational auto-encoder (VAE), Adversarial training.
Multimodal transfer learning aims to transform pretrained representations of diverse modalities into a common domain space for effective multimodal fusion. However, conventional systems are typically built on the assumption that all modalities exist, and the lack of modalities always leads to poor inference performance. Furthermore, extracting pretrained embeddings for all modalities is computationally inefficient for inference. In this work, to achieve high efficiency-performance multimodal transfer learning, we propose VideoAdviser, a video knowledge distillation method to transfer multimodal knowledge of video-enhanced prompts from a multimodal fundamental model (teacher) to a specific modal fundamental model (student). With an intuition that the best learning performance comes with professional advisers and smart students, we use a CLIP-based teacher model to provide expressive multimodal knowledge supervision signals to a RoBERTa-based student model via optimizing a step-distillation objective loss-first step: the teacher distills multimodal knowledge of video-enhanced prompts from classification logits to a regression logit-second step: the multimodal knowledge is distilled from the regression logit of the teacher to the student. We evaluate our method in two challenging multimodal tasks: video-level sentiment analysis (MOSI and MOSEI datasets) and audio-visual retrieval (VEGAS dataset). The student (requiring only the text modality as input) achieves an MAE score improvement of up to 12.3% for MOSI and MOSEI. Our method further enhances the stateof-the-art method by 3.4% mAP score for VEGAS without additional computations for inference. These results suggest the strengths of our method for achieving high efficiency-performance multimodal transfer learning. INDEX TERMSMultimodal transfer learning, knowledge distillation, fundamental model. YANAN WANG (Student Member, IEEE) received the B.S. degree in engineering from Aoyama Gakuin University, in 2015, and the M.S. degree in engineering from The University of Electro-Communications, Japan, in 2017. He is currently pursuing the Ph.D. degree in engineering with Keio University, Japan. He is an Associate Research Engineer in multimodal modeling topics with KDDI Research Inc. His research interests include multimodal representation learning, emotion recognition, knowledge graph, and graph representation learning. He is a Student Member of JSAI. He is a regular member of IEICE and the Editorial Committee of IEICE Human Communication Group.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.