Leveraging data from various modalities to enhance multimodal segmentation tasks is a well-regarded approach. Recently, efforts have been made to incorporate an array of modalities, including depth and thermal imaging. Nevertheless, the effective amalgamation of cross-modal interactions remains a challenge, given the unique traits each modality presents. In our current research, we introduce the semantic guidance fusion network (SGFN), which is an innovative cross-modal fusion network adept at integrating a diverse set of modalities. Particularly, the SGFN features a semantic guidance module (SGM) engineered to boost bi-modal feature extraction. It encompasses a learnable semantic guidance convolution (SGC) designed to merge intensity and gradient data from disparate modalities. Comprehensive experiments carried out on the NYU Depth V2, SUN-RGBD, Cityscapes, MFNet, and ZJU datasets underscore both the superior performance and generalization ability of the SGFN compared to the current leading models. Moreover, when tested on the DELIVER dataset, the efficiency of our bi-modal SGFN displayed a mIoU that is comparable to the hitherto leading model, CMNEXT.