Real-Time RGB-D Camera Pose Estimation in Novel Scenes Using a Relocalisation Cascade

Cavallari, Tommaso; Golodetz, Stuart; Lord, Nicholas A.; Valentin, Julien; Prisacariu, Victor Adrian; Stefano, Luigi Di; Torr, Philip H. S.

doi:10.1109/tpami.2019.2915068

Cited by 75 publications

(87 citation statements)

References 72 publications

(232 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…At online training time (purple and red boxes), we fill the reservoirs with points from the target scene, which we cluster using Really Quick Shift [14]. At test time (purple and blue boxes), we predict a reservoir for each pixel, and use the point clusters the reservoirs contain to generate correspondences that can be passed to a Kabsch-RANSAC camera pose estimation backend [12] to relocalise the camera: see §2. 4.…”

Section: Back-project Pointsmentioning

confidence: 99%

“…One online local regression approach is that of [13,12], which showed how to adapt the regression forests of [63] for online use in real time. Their approach achieves stateof-the-art performance on the popular 7-Scenes [63] and Stanford 4 Scenes [68] indoor datasets, and also performs well on some of the easier outdoor scenes from Cambridge Landmarks [36,34,35].…”

Section: Back-project Pointsmentioning

confidence: 99%

“…Their approach achieves stateof-the-art performance on the popular 7-Scenes [63] and Stanford 4 Scenes [68] indoor datasets, and also performs well on some of the easier outdoor scenes from Cambridge Landmarks [36,34,35]. However, because their forests use hand-crafted features that were designed for indoor use [63], they struggle [12] to work out-of-the-box on harder outdoor scenes. Whilst it might in principle be possible to solve this problem by hand-crafting new features for outdoor use, doing so could be time-consuming and costly.…”

Section: Back-project Pointsmentioning

confidence: 99%

“…Indeed, the broader trend in machine learning has been towards replacing models such as regression forests with neural networks that can learn suitable features, rather than trying to hand-craft them manually. However, replacing the forests used by [13,12] with networks is not straightforward. To achieve online relocalisation, they rely on the way in which their forests predict leaves containing reservoirs of points to adapt forests between scenes, and it is tricky to see how this scheme can be easily transferred to work with local regression networks, which tend to directly predict individual points in the training scene.…”

Section: Back-project Pointsmentioning

confidence: 99%

See 3 more Smart Citations

Let's Take This Online: Adapting Scene Coordinate Regression Network Predictions for Online RGB-D Camera Relocalisation

Cavallari

Bertinetto

Mukhoti

et al. 2019

2019 International Conference on 3D Vision (3DV)

Self Cite

View full text Add to dashboard Cite

Many applications require a camera to be relocalised online, without expensive offline training on the target scene. Whilst both keyframe and sparse keypoint matching methods can be used online, the former often fail away from the training trajectory, and the latter can struggle in textureless regions. By contrast, scene coordinate regression (SCoRe) methods generalise to novel poses and can leverage dense correspondences to improve robustness, and recent work has shown how to adapt SCoRe forests between scenes, allowing their state-of-the-art performance to be leveraged online. However, because they use features hand-crafted for indoor use, they do not generalise well to harder outdoor scenes. Whilst replacing the forest with a neural network and learning suitable features for outdoor use is possible, the techniques used to adapt forests between scenes are unfortunately harder to transfer to a network context. In this paper, we address this by proposing a novel way of leveraging a network trained on one scene to predict points in another scene. Our approach replaces the appearance clustering performed by the branching structure of a regression forest with a two-step process that first uses the network to predict points in the original scene, and then uses these predicted points to look up clusters of points from the new scene. We show experimentally that our online approach achieves state-of-the-art performance on both the 7-Scenes and Cambridge Landmarks datasets, whilst running in under 300ms, making it highly effective in live scenarios.

show abstract

Section: Back-project Pointsmentioning

confidence: 99%

Section: Back-project Pointsmentioning

confidence: 99%

Section: Back-project Pointsmentioning

confidence: 99%

Section: Back-project Pointsmentioning

confidence: 99%

See 2 more Smart Citations

Let's Take This Online: Adapting Scene Coordinate Regression Network Predictions for Online RGB-D Camera Relocalisation

Cavallari

Bertinetto

Mukhoti

et al. 2019

2019 International Conference on 3D Vision (3DV)

Self Cite

View full text Add to dashboard Cite

show abstract

“…Index Terms-Heterogeneous, FPGA, real-time, stereo, depth Obtaining information about the 3D structure of a scene is important for many computer vision and robotics applications, e.g. 3D scene reconstruction [1]- [3], camera relocalisation [4]- [6], navigation and obstacle avoidance [7]. Often, this information will be obtained in the form of a depth image, and various options for acquiring such images exist.…”

mentioning

confidence: 99%

Real-Time Highly Accurate Dense Depth on a Power Budget Using an FPGA-CPU Hybrid SoC

Rahnama

Cavallari²,

Golodetz³

et al. 2019

IEEE Trans. Circuits Syst. II

Self Cite

View full text Add to dashboard Cite

Obtaining highly accurate depth from stereo images in real time has many applications across computer vision and robotics, but in some contexts, upper bounds on power consumption constrain the feasible hardware to embedded platforms such as FPGAs. Whilst various stereo algorithms have been deployed on these platforms, usually cut down to better match the embedded architecture, certain key parts of the more advanced algorithms, e.g. those that rely on unpredictable access to memory or are highly iterative in nature, are difficult to deploy efficiently on FPGAs, and thus the depth quality that can be achieved is limited. In this paper, we leverage a FPGA-CPU chip to propose a novel, sophisticated, stereo approach that combines the best features of SGM and ELAS-based methods to compute highly accurate dense depth in real time. Our approach achieves an 8.7% error rate on the challenging KITTI 2015 dataset at over 50 FPS, with a power consumption of only 5W.Obtaining information about the 3D structure of a scene is important for many computer vision and robotics applications, e.g. 3D scene reconstruction [1]-[3], camera relocalisation [4]-[6], navigation and obstacle avoidance [7]. Often, this information will be obtained in the form of a depth image, and various options for acquiring such images exist. Passive approaches, which rely only on one or more image sensors, are popular due their low cost, low weight and size, lack of active/moving components, ability to work at longer ranges, deployability in a wider range of operating environments and lack of interference. Among them, binocular stereo relies on a pair of synchronised cameras to acquire the same scene from two different points of view. Given the two frames, a dense and reliable depth map can be computed by finding correspondences between the pixels in the two images [8]. State-of-the-art algorithms for this problem usually rely on costly global image optimisations or on massive convolutional neural networks that involve significant computational costs, making them hard to deploy on resource-limited systems such as embedded devices [9].Two popular solutions offering a good trade-off between speed and accuracy are Semi-Global Matching (SGM) [10] and ELAS [11]. SGM computes initial matching hypotheses by comparing patches around pixels in the left and right images, then approximates a costly image-wide smoothness constraint with the sum of several directional minimizations over the Correspondence: {oscar@robots.ox.ac.uk} O. Rahnama is with the University of Oxford and FiveAI Ltd. T. Joy and P. Torr are with the University of Oxford. A. Tonioni and L. Di Stefano are with the University of Bologna. T. Cavallari, S. Golodetz and S. Walker are with FiveAI Ltd. Work done whilst A. Tonioni was visiting the University of Oxford.disparity range. By contrast, ELAS first identifies a set of sparse but reliable correspondences to provide a coarse approximation of the scene geometry, then uses them to define slanted plane priors that guide the final dense matching stage. ...

show abstract

RGB-D Object Classification for Autonomous Driving Perception

Premebida

Melotti

Asvadi

2019

RGB-D Image Analysis and Processing

View full text Add to dashboard Cite

Real-Time RGB-D Camera Pose Estimation in Novel Scenes Using a Relocalisation Cascade

Cited by 75 publications

References 72 publications

Let's Take This Online: Adapting Scene Coordinate Regression Network Predictions for Online RGB-D Camera Relocalisation

Let's Take This Online: Adapting Scene Coordinate Regression Network Predictions for Online RGB-D Camera Relocalisation

Real-Time Highly Accurate Dense Depth on a Power Budget Using an FPGA-CPU Hybrid SoC

RGB-D Object Classification for Autonomous Driving Perception

Contact Info

Product

Resources

About