Spatially resolved transcriptomics simultaneously measure the spatial location, histology images, and transcriptional profiles of the same cells or regions in undissociated tissues. Integrative analysis of multi-modal spatially resolved data holds immense potential for understanding the mechanisms of biology. Here we present a flexible multi-modal contrastive learning for the integration of spatially resolved transcriptomics (MuCST), which jointly perform denoising, elimination of heterogeneity, and compatible feature learning. We demonstrate that MuCST robustly and accurately identifies tissue subpopulations from simulated data with various types of perturbations. In cancer-related tissues, MuCST precisely identifies tumor-associated domains, reveals gene biomarkers for tumor regions, and exposes intra-tumoral heterogeneity. We also validate that MuCST is applicable to diverse datasets generated from various platforms, such as STARmap, Visium, and omsFISH for spatial transcriptomics, and hematoxylin and eosin or fluorescence microscopy for images. Overall, MuCST not only facilitates the integration of multi-modal spatially resolved data, but also serves as pre-processing for data restoration (Python software is available athttps://github.com/xkmaxidian/MuCST).