The goal of street-to-aerial cross-view image geo-localization is to determine the location of the query street-view image by retrieving the aerial-view image from the same place. The drastic viewpoint and appearance gap between the aerial-view and the street-view images brings a huge challenge against this task. In this paper, we propose a novel multiscale attention encoder to capture the multiscale contextual information of the aerial/street-view images. To bridge the domain gap between these two view images, we first use an inverse polar transform to make the street-view images approximately aligned with the aerial-view images. Then, the explored multiscale attention encoder is applied to convert the image into feature representation with the guidance of the learnt multiscale information. Finally, we propose a novel global mining strategy to enable the network to pay more attention to hard negative exemplars. Experiments on standard benchmark datasets show that our approach obtains 81.39% top-1 recall rate on the CVUSA dataset and 71.52% on the CVACT dataset, achieving the state-of-the-art performance and outperforming most of the existing methods significantly.
K E Y W O R D S global mining strategy, image geo-localization, multiscale attention encoder, street-to-aerial cross-viewThis is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited.