It is a challenging task for unmanned aerial vehicles (UAVs) without a positioning system to locate targets by using images. Matching drone and satellite images is one of the key steps in this task. Due to the large angle and scale gap between drone and satellite views, it is very important to extract finegrained features with strong characterization ability. Most of the published methods are based on the CNN structure, but a lot of information will be lost when using such methods. This is caused by the limitations of the convolution operation (e.g. limited receptive field and downsampling operation). To make up for this shortcoming, a transformer-based network is proposed to extract more contextual information. The network promotes feature alignment through semantic guidance module (SGM). SGM aligns the same semantic parts in the two images by classifying each pixel in the images based on the attention of pixels. In addition, this method can be easily combined with existing methods. The proposed method has been implemented with the newest UAV-based geo-localization dataset. Compared with the existing state-of-the-art (SOTA) method, the proposed method achieves almost 8% improvement in accuracy.
INDEX TERMSCross-view image matching, geo-localization, UAV image localization, deep neural network.