2023
DOI: 10.1609/aaai.v37i3.25457
|View full text |Cite
|
Sign up to set email alerts
|

Cross-View Geo-Localization via Learning Disentangled Geometric Layout Correspondence

Abstract: Cross-view geo-localization aims to estimate the location of a query ground image by matching it to a reference geo-tagged aerial images database. As an extremely challenging task, its difficulties root in the drastic view changes and different capturing time between two views. Despite these difficulties, recent works achieve outstanding progress on cross-view geo-localization benchmarks. However, existing methods still suffer from poor performance on the cross-area benchmarks, in which the training and testin… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
5
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
5
3

Relationship

1
7

Authors

Journals

citations
Cited by 26 publications
(5 citation statements)
references
References 37 publications
(170 reference statements)
0
5
0
Order By: Relevance
“…Tian et al [32] used semantic segmentation and generative adversarial networks to process satellite images and ViT to retrieve images. Zhang et al [33] combined CNN and Transformer to extract the spatial configuration of visual feature layouts. Shi et al [34] applied regular polar transform to reduce the differences between ground and satellite images.…”
Section: Image Retrieval Methods For Cross-view Geo-localizationmentioning
confidence: 99%
“…Tian et al [32] used semantic segmentation and generative adversarial networks to process satellite images and ViT to retrieve images. Zhang et al [33] combined CNN and Transformer to extract the spatial configuration of visual feature layouts. Shi et al [34] applied regular polar transform to reduce the differences between ground and satellite images.…”
Section: Image Retrieval Methods For Cross-view Geo-localizationmentioning
confidence: 99%
“…Rashkovetsky et al combined multiple U-Nets [11] to detect and segment wildfires from multi-band satellite images [6]. Zhang et al proposed to identify the location of ground images by comparing the learned features from geo-tagged satellite images [25,26]. Li et al studied poverty mapping from satellite images by proposing a point-to-region dynamic learning framework that can help people find clean water and sanitation services in low-income countries [27].…”
Section: Satellite Imagery Analysismentioning
confidence: 99%
“…Different from some of the methods mentioned above, for example, refs. [25,27], which only use color channel information, considering the nature of permeable surface mapping, we also take advantage of the infrared band. Our extensive experiments and ablation studies in Section 4 demonstrate the effectiveness of the infrared band in our proposed model.…”
Section: Satellite Imagery Analysismentioning
confidence: 99%
“…L2LTR [16] employs a transformer encoder as a backbone, utilizing self and cross attention mechanisms to emulate global dependency relationships between adjacent layers, thereby enhancing the quality of the learned representations. GeoDTR [17] utilizes a transformer encoder to separate geometric information from the original features and, through a novel geometric contextual extraction module, it can learn the spatial correlations between visual features in satellite and ground images. TransGeo [18] fully leverages the advantages of transformer encoder global information modeling and explicit positional encoding, reducing computational costs and enhancing performance.…”
Section: B Transformer In Geo-localizationmentioning
confidence: 99%
“…The remarkable contextual modeling capability of the transformer compensates for the limitations of CNNs. At present, transformer-based cross-view geo-localization technology mainly utilizes transformer encoders as the backbone of feature extraction, improving the ability of contextual feature extraction [15], [16], [17], [18]. Some methods use ViT [19] as the backbone for extracting context-sensitive information [20], [21] to better adapt to image data.…”
Section: Introductionmentioning
confidence: 99%