“…The 2D-3D matches are then used for camera pose estimation, e.g., by applying a PnP solver [1,31,41,43,45,46] inside a robust estimator such as RANSAC [4,9,17,25,47,66]. These visual localization methods typically use either local image descriptors [19,22,34,52] to explicitly match 2D features to 3D scene points or use machine learning, e.g., via a random forest [15,16] or a convolutional neural network (CNN) [6,7,14], to regress the corresponding 3D scene coordinate per pixel. They build a scene representation, e.g., a 3D Structure-from-Motion (SfM) model for local features or a CNN for scene coordinate regression, from a set of reference images.…”