Characterizing and understanding the structure of tissues from spatially resolved transcriptomics (SRT) is of great value for deciphering the functionality of spatial domains. However, the inherent heterogeneity and varying spatial resolutions among SRT multi-modal data present challenges in the joint analysis of these modalities. In this study, we introduce a multi-modal feature representation method, named stMMR, to effectively integrate gene expression, spatial location and histological imaging information for accurate identifying spatial domains from SRT data. stMMR uses self-attention module for deep embedding of features within unimodal and incorporates similarity contrastive learning for integrating features across modalities. stMMR demonstrates superior performances in multiple analyses using datasets generated by different platforms, including spatial domain identification, pseudo-spatiotemporal analysis as well as domain-specific gene discovery. Using stMMR, we systematically analyzed the evolving lineage structures of the chicken heart and conducted an in-depth examination of domain-specific genes in breast cancer and lung cancer. In conclusion, stMMR is capable of effectively integrating the multi-modal information for spatial domain identification and exhibits superior adaptability as well as stability in handling different types of SRT data.