Recent advances in spatial transcriptomics have enabled measurements of gene expression at cell/spot resolution meanwhile retaining both the spatial information and the histopathological images of the tissues. Deciphering the spatial domains of spots in the tissues is a vital step for various downstream tasks in spatial transcriptomics analysis. Existing methods have been developed for this purpose by combining gene expression and histopathological images to conquer noises in gene expression. However, current methods only use the histopathological images to construct spot relations without updating in the training stage, or simply concatenate the information from gene expression and images into one feature vector. Here, we propose a novel method ConGI to accurately decipher spatial domains by integrating gene expression and histopathological images, where the gene expression is adapted to image information through contrastive learning. We introduce three contrastive loss functions within and between modalities to learn the common semantic representations across all modalities while avoiding their meaningless modality-private noise information. The learned representations are then used for deciphering spatial domains through a clustering method. By comprehensive tests on tumor and normal spatial transcriptomics datasets, ConGI was shown to outperform existing methods in terms of spatial domain identification. More importantly, the learned representations from our model have also been used efficiently for various downstream tasks, including trajectory inference, clustering, and visualization.