“…Compared to using a single modality, multi-modalities significantly improve the performance of learning models [13,14,15,16,17]. Several relevant surveys already exist, such as deep learning-based semantic segmentation [2,3,18,19], indoor scene understanding [20,21], multimodal perception for autonomous driving [22], multimodal human motion recognition [23], multimodal medical image segmentation [24], and multimodal learning study [25,26]. However, these review works are mostly focused on unimodal image segmentation, multimodal fusion for specific domains, or multimedia analysis across video, audio, and text.…”