“…Especially for cross-modality localization, the identification of reliable landmarks for various sensor and map modalities is non-trivial. Here, learningbased approaches have become the state-of-the-art for both cross-modal PR [1], [11], [34], [35], [36] as well as local pose tracking, achieving localization accuracies below 1 m for various sensor modalities, including radar-to-lidar [2], [37], [38], range-to-aerial-imagery [35], [39], [40], [41] and camera-to-aerial-imagery, also called cross-view geolocalization (CVGL) [27], [41], [42].…”