Feature representation is a crucial issue in multimodal image registration. The handcrafted features extracted by traditional methods are highly sensitive to nonlinear radiation differences, while supervised learning methods are limited by deficient labeled samples in the remote sensing field. Therefore, this paper proposes a consistency feature representation learning method for multimodal image registration, which involves mapping data into a common feature space to realize the accurate alignment of remote sensing images. First, a contrastive network with a spatial attention mechanism is driven to enhance the capability to highlight high-level features of images. Second, a positive sample augmentation strategy is implemented with contrastive loss, which helps the model learn the inherent features better, and imposes constraints on the sample similarity to optimize the feature projection. Finally, a multimodal image registration framework is proposed to enhance the stability of feature matching. The proposed framework achieves accurate feature extraction and consistency feature description for multimodal images, ensuring robustness against nonlinear radiometric differences. Experimental results demonstrate that the proposed method obtains more reliable registration results on the SEN1-2 dataset. Furthermore, the proposed algorithm achieves superior performance on data from other modalities, indicating strong generalization ability.