Using satellite images and UAV images to locate the same geographical target can provide new ideas for UAV positioning and navigation. However, the images from the two different remote sensing platforms, UAV and satellite, have a huge difference in appearance, which is a huge challenge for this task. Existing methods are usually limited to convolutional neural networks, leading to a lack of utilization of global information of images, and these methods do not focus on the spatial information of images. To address these issues, this research propose a method that extracts posture information using SFM(Structure-from-Motion) and then uses the posture to align the spatial features of the image, and introduces a visual transformer to focus the network on acquiring the common feature space shared by the viewpoint sources. In this study, a large number of experiments have been carried out on the large datum dataset University-1652. The experimental results show that the method proposed in this paper outperforms the baseline and has the same advantages as other advanced methods.