Feature matching is a core step in multi-source remote sensing image registration approaches based on feature. However, for existing methods, whether traditional classical SIFT algorithm or deep learning-based methods, they essentially rely on generating descriptors from local regions of feature points, which can lead to low matching success rates due to various challenges, including gray-scale changes, content changes, local similarity, and occlusions between images. Inspired by the human approach of finding rough corresponding regions globally and then carefully comparing local regions, and the excellent global attention property of transformers, the proposed feature matching network adopts a coarse-to-fine matching strategy that utilizes both global and local information between images to predict corresponding feature points. Importantly, the network has great flexibility of matching corresponding points for any feature points and can be effectively trained without strong supervised signals of corresponding feature points and only require the true geometric transformation between images. The qualitative experiment illustrate the effectiveness of the proposed network by matching feature points extracted by SIFT or sampled uniformly. In the quantitative experiments, we used feature points extracted by SIFT, SuperPoint, and LoFTR as the keypoints to be matched. We then calculated the mean match success ratio (MSR) and mean reprojection error (MRE) of each method at different thresholds in the test dataset. Additionally, boxplot graphs were plotted to visualize the distributions. By comparing the MSR and MRE values as well as their distributions with other methods, we can conclude that the proposed method consistently outperforms the comparison methods in terms of MSR at different thresholds. Moreover, the MSR of the proposed method remains within a reasonable range compared to the MRE of other methods.