“…However, the majority of previous works have focused on the unimodal setting, where all the clients in the federated system hold the same data modality, as shown in Figure 1 (left). Among these studies, statistical heterogeneity [3], i.e., the non-IID challenge, caused by the skew of labels, features, and data quantity among clients, is one of the most critical challenges that has attracted much attention [4][5][6][7][8]. In contrast, multimodal federated learning, as shown in Figure 1 (right), further introduced the modality heterogeneity challenge, which led to significant differences in model structures, local tasks, and parameter spaces among clients, thereby exposing the substantial limitations of traditional unimodal algorithms.…”