Cross-view Geo-localization with Evolving Transformer

Yang, Hongji; Lu, Xiufan; Zhu, Yingying

doi:10.48550/arxiv.2107.00842

Cited by 3 publications

(3 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In 2020, Dosovitskiy et al [9] creatively applied Transformer to the image field, divided continuous images into discrete Tokens through Patch Embedding, and converted the input into a sequence. In 2021, EgoTR [10] first introduced the Transformer model into the cross view localization task, enabling global context learning and location aware representation. At the same time, it also proposed a novel self-attention to promote cross layer information flow and achieve the transcendence of the CNN model.…”

Section: Transformer-based Methodsmentioning

confidence: 99%

“…The inference speed is 1.48 times that of DSM and 2.52 times that of EgoTR, reaching the real-time speed of 33.78 FPS. Param(MB) FPS(Frame per second) SAFA [6] 29.5 35.90 DSM [8] 17.9 22.72 EgoTR [10] 195.9 3, the offset module has N branches with identical structures but not shared parameters. Different branches are responsible for learning the features of different offsets.…”

Section: Comparison Of Model Size and Inference Speedmentioning

confidence: 99%

See 1 more Smart Citation

Multi-branch offset architecture for unaligned cross-view geo localization

Li¹,

Su²,

Wang³

2023

International Conference on Computer Vision, Application, and Algorithm (CVAA 2022)

View full text Add to dashboard Cite

Cross-view geo-localization usually requires matching a ground-view image with the corresponding aerial-view image. This task has attracted researchers'attention in recent years due to the huge differences in perspectives and the occlusion of information between the two views. Previous work has focused more on alignment conditions, where aerial and ground views are aligned north. The explicit alignment information is beneficial to model learning. However, in this work, we extend this task to unaligned conditions and expect the model to perform well even with random offsets in the ground-view image. To solve this problem, we propose a new multi-branch offset (MBO) architecture. Specifically, we divide the feature extracted by the CNN into N parts and shift the feature along the horizontal direction. Features of different offset sizes are then fed into subsequent multi-branch Transformer blocks, and the relationships between channels are further expressed by using channel attention. Finally, the enhanced features are fused with the output of the original CNN. We validated our results on popular dataset, where 78.13% of the Top-1 recalls and 33.78 FPS were made on CVUSA, which is higher than the existing model and achieves efficient inference speed.

show abstract

Section: Transformer-based Methodsmentioning

confidence: 99%

Section: Comparison Of Model Size and Inference Speedmentioning

confidence: 99%

Multi-branch offset architecture for unaligned cross-view geo localization

Li¹,

Su²,

Wang³

2023

International Conference on Computer Vision, Application, and Algorithm (CVAA 2022)

View full text Add to dashboard Cite

show abstract

“…Tulder et al [57] presented a novel cross-view transformer method to transfer information between unregistered views at the level of spatial feature maps, which achieved remarkable results in field of Multi-view medical image analysis. Yang et al [58] proposed a simple yet effective self-cross attention mechanism to improve the quality of learned representations. Which improved the generalization ability and encourages representations to keep evolving as the network goes deeper.…”

Section: B Transformer In Visionmentioning

confidence: 99%

A Transformer-Based Feature Segmentation and Region Alignment Method For UAV-View Geo-Localization

Dai,

Hu,

Zhuang

et al. 2022

Preprint

View full text Add to dashboard Cite

Cross-view geo-localization is a task of matching the same geographic image from different views, e.g., unmanned aerial vehicle (UAV) and satellite. The most difficult challenges are the position shift and the uncertainty of distance and scale. Existing methods are mainly aimed at digging for more comprehensive fine-grained information. However, it underestimates the importance of extracting robust feature representation and the impact of feature alignment. The CNN-based methods have achieved great success in cross-view geo-localization. However it still has some limitations, e.g., it can only extract part of the information in the neighborhood and some scale reduction operations will make some fine-grained information lost. In particular, we introduce a simple and efficient transformer-based structure called Feature Segmentation and Region Alignment (FSRA) to enhance the model's ability to understand contextual information as well as to understand the distribution of instances. Without using additional supervisory information, FSRA divides regions based on the heat distribution of the transformer's feature map, and then aligns multiple specific regions in different views one on one. Finally, FSRA integrates each region into a set of feature representations. The difference is that FSRA does not divide regions manually, but automatically based on the heat distribution of the feature map. So that specific instances can still be divided and aligned when there are significant shifts and scale changes in the image. In addition, a multiple sampling strategy is proposed to overcome the disparity in the number of satellite images and that of images from other sources. Experiments show that the proposed method has superior performance and achieves the state-of-the-art in both tasks of drone view target localization and drone navigation. Code will be released at https://github.com/Dmmm1997/FSRA

show abstract