“…Multimodal learning leverages heterogeneous and comprehensive signals, such as acoustic, visual, lexical information to perform typical machine learning tasks, for instance, clustering, regression, classification, and retrieval (Sun, Dong, and Liu 2021;Zhang et al 2022;Han et al 2022). Relative to its unimodal counterpart, multimodal learning has demonstrated great success in numerous applications, including but not limited to medical analysis (Liu et al 2023), action recognition (Woo et al 2023), affective computing (Sun et al 2023). Nevertheless, multimodal learning is inevitably faced with the modality missing issue due to malfunctioning sensors, high data acquisition costs, privacy concerns, etc.…”