Ground-to-Aerial Image Geo-Localization With a Hard Exemplar Reweighting Triplet Loss

Cai, Sudong; Guo, Yulan; Khan, Salman; Hu, Jiwei; Wen, Gongjian

doi:10.1109/iccv.2019.00848

Cited by 113 publications

(87 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Effect of loss objectives. The triplet loss and contrastive loss are widely applied in previous works [5,6,16,35,43], and the weighted soft margin triplet loss is deployed in [4,12,18]. We evaluate these three losses on two tasks, i.e., Drone → Satellite and Satellite → Drone and compare three losses with the instance loss used in our baseline.…”

Section: Ablation Study and Further Discussionmentioning

confidence: 99%

University-1652: A Multi-view Multi-source Benchmark for Drone-based Geo-localization

Zheng

Wei

Yang

2020

Proceedings of the 28th ACM International Conference on Multimedia

179

134

View full text Add to dashboard Cite

Figure 1: It is challenging, even for a human, to associate (a) ground-view images with (b) satellite-view images. In this paper, we introduce a new dataset based on the third platform, i.e., drone, to provide real-life viewpoints and intend to bridge the visual gap against views. (c) Here we show two real drone-view images collected from public drone flights on Youtube [1, 8]. (d) In practice, we use the synthetic drone-view camera to simulate the real drone flight. It is based on two concerns. First, the collection expense of real drone flight is unaffordable. Second, the synthetic camera has a unique advantage in the manipulative viewpoint. Specifically, the 3D engine in Google Earth is utilized to simulate different viewpoints in the real drone camera.

show abstract

Section: Ablation Study and Further Discussionmentioning

confidence: 99%

University-1652: A Multi-view Multi-source Benchmark for Drone-based Geo-localization

Zheng

Wei

Yang

2020

Proceedings of the 28th ACM International Conference on Multimedia

179

134

View full text Add to dashboard Cite

show abstract

“…To steer where to focus in images, the attention mechanism is introduced to the field of cross-view geolocalization. Cai et al [2] introduce a lightweight attention module that combines spatial and channel attention mechanisms to emphasize visually salient features. They also propose a novel reweighting loss that adaptively allocates weights to triplets according to their difficulties, thus improving the quality of network training.…”

Section: Related Workmentioning

confidence: 99%

“…Towards the above goal, several recent works incorporate convolutional neural networks (CNNs) with NetVlad layers [8], capsule networks [20] or attention mechanisms [2,16] to learn visually discriminative representations. However, the locality assumption of their CNN architectures hinders their performance in complex scenarios, where visual interferences such as obstacles and transient objects (e.g., cars and pedestrian) may exist.…”

Section: Introductionmentioning

confidence: 99%

Cross-view Geo-localization with Evolving Transformer

Yang¹,

Lu²,

Zhu³

2021

Preprint

View full text Add to dashboard Cite

In this work, we address the problem of cross-view geo-localization, which estimates the geospatial location of a street view image by matching it with a database of geo-tagged aerial images. The cross-view matching task is extremely challenging due to drastic appearance and geometry differences across views. Unlike existing methods that predominantly fall back on CNN, here we devise a novel evolving geo-localization Transformer (EgoTR) that utilizes the properties of self-attention in Transformer to model global dependencies, thus significantly decreasing visual ambiguities in cross-view geo-localization. We also exploit the positional encoding of Transformer to help the EgoTR understand and correspond geometric configurations between ground and aerial images. Compared to state-of-the-art methods that impose strong assumption on geometry knowledge, the EgoTR flexibly learns the positional embeddings through the training objective and hence becomes more practical in many real-world scenarios. Although Transformer is well suited to our task, its vanilla self-attention mechanism independently interacts within image patches in each layer, which overlooks correlations between layers. Instead, this paper propose a simple yet effective self-cross attention mechanism to improve the quality of learned representations. The self-cross attention models global dependencies between adjacent layers, which relates between image patches while modeling how features evolve in the previous layer. As a result, the proposed self-cross attention leads to more stable training, improves the generalization ability and encourages representations to keep evolving as the network goes deeper. Extensive experiments demonstrate that our EgoTR performs favorably against state-of-the-art methods on standard, fine-grained and cross-dataset cross-view geo-localization tasks. Codes are available in the supplementary material.

show abstract

“…[19] showed that orientation information, in the form of hand-crafted UV maps, helps to convey the approximate viewpoint difference to the network during training. Recently, [5] applied both spatial and channel-wise attention to the feature maps and trained them with a hard exemplar reweighting triplet loss.…”

Section: Domain-invariant Features a Central Question In Cross-mentioning

confidence: 99%

“…Satellite imagery, on the other hand, is broadly available for most parts of the world with services like Google maps. This encouraged researchers to focus on cross-view imagebased geo-localization [40,18,36,19,31,5,33,32] as a more general and inclusive alternative. The overall idea is to predict the latitude and longitude of a street-level image by matching it against a GPS-tagged satellite database.…”

Section: Introductionmentioning

confidence: 99%