Learning to Solve Nonlinear Least Squares for Monocular Stereo

Clark, Ronald; Bloesch, Michael; Czarnowski, Jan; Leutenegger, Stefan; Davison, Andrew J.

doi:10.1007/978-3-030-01237-3_18

Cited by 62 publications

(47 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Finally, LS-Net [36] was a recent approach to learningbased monocular multi-view stereo and egomotion estimation. While VIOLearner [6], [37], which the present work extends, was the first approach to use a learned optimizer to minimize photometric loss for egomotion estimation, [36] similarly leveraged Jacobians and optimized both for egomotion as well as depth. However, their approach is supervised and required ground truth.…”

Section: Learning-based Methodsmentioning

confidence: 99%

Unsupervised Deep Visual-Inertial Odometry with Online Error Correction for RGB-D Imagery

Shamwell

Lindgren

Leung

et al. 2020

IEEE Trans. Pattern Anal. Mach. Intell.

View full text Add to dashboard Cite

While numerous deep approaches to the problem of vision-aided localization have been recently proposed, systems operating in the real world will undoubtedly experience novel sensory states previously unseen even under the most prodigious training regimens. We address the localization problem with online error correction (OEC) modules that are trained to correct a vision-aided localization network's mistakes. We demonstrate the generalizability of the OEC modules and describe our unsupervised deep neural network approach to the fusion of RGB-D imagery with inertial measurements for absolute trajectory estimation. Our network, dubbed the Visual-Inertial-Odometry Learner (VIOLearner), learns to perform visual-inertial odometry (VIO) without inertial measurement unit (IMU) intrinsic parameters or the extrinsic calibration between an IMU and camera. The network learns to integrate IMU measurements and generate hypothesis trajectories which are then corrected online according to the Jacobians of scaled image projection errors with respect to spatial grids of pixel coordinates. We evaluate our network against state-of-the-art (SoA) VIO, visual odometry (VO), and visual simultaneous localization and mapping (VSLAM) approaches on the KITTI Odometry dataset as well as a micro aerial vehicle (MAV) dataset that we collected in the AirSim simulation environment. We demonstrate better than SoA translational localization performance against comparable SoA approaches on our evaluation sequences.

show abstract

Section: Learning-based Methodsmentioning

confidence: 99%

Unsupervised Deep Visual-Inertial Odometry with Online Error Correction for RGB-D Imagery

Shamwell

Lindgren

Leung

et al. 2020

IEEE Trans. Pattern Anal. Mach. Intell.

View full text Add to dashboard Cite

show abstract

“…A comparison of this approach with our method when using only the two-view feature network (A) shows the benefits of our two-view feature encoder. • IC-FC-LS-Net: We also implemented LS-Net [15] within our IC framework with the following differences to the original paper. First, we do not estimate or refine depth.…”

Section: Direct Pose Regressionmentioning

confidence: 99%

Taking a Deeper Look at the Inverse Compositional Algorithm

Dellaert

Rehg

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

In this paper, we provide a modern synthesis of the classic inverse compositional algorithm for dense image alignment. We first discuss the assumptions made by this wellestablished technique, and subsequently propose to relax these assumptions by incorporating data-driven priors into this model. More specifically, we unroll a robust version of the inverse compositional algorithm and replace multiple components of this algorithm using more expressive models whose parameters we train in an end-to-end fashion from data. Our experiments on several challenging 3D rigid motion estimation tasks demonstrate the advantages of combining optimization with learning-based techniques, outperforming the classic inverse compositional algorithm as well as data-driven image-to-pose regression approaches.1 The warping function W ξ : R 2 → R 2 might represent translation, affine 2D motion or (if depth is available) rigid or non-rigid 3D motion. To avoid clutter in the notation, we do not make W ξ explicit in our equations.

show abstract

“…This volumetric fusion approach, popularised by KinectFusion [27], works by first tracking the camera pose and then it uses the integration approach of Curless and Levoy [9] to fuse the depth images into the volume. Various improvements have been introduced, mainly focused on reducing tracking drift [7] and increasing the size of scenes that can be reconstructed. Kintinuous [41], for example, uses a sliding volume to map large spaces.…”

Section: Related Workmentioning

confidence: 99%

X-Section: Cross-Section Prediction for Enhanced RGB-D Fusion

Nicastro

Clark

Leutenegger

2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

Self Cite

View full text Add to dashboard Cite

Figure 1: Our approach uses predictions of the objects cross-sectional thickness to improve volumetric reconstruction quality. Top row shows the input to the proposed pipeline, an RGB-D frame. Bottom, cross-section prediction. From left to right in the middle, incremental reconstruction via our enhanced TSDF fusion algorithm. AbstractDetailed 3D reconstruction is an important challenge with application to robotics, augmented and virtual reality, which has seen impressive progress throughout the past years. Advancements were driven by the availability of depth cameras (RGB-D), as well as increased compute power, e.g. in the form of GPUs -but also thanks to inclusion of machine learning in the process. Here, we propose X-Section, an RGB-D 3D reconstruction approach that leverages deep learning to make object-level predictions about thicknesses that can be readily integrated into a volumetric multi-view fusion process, where we propose an extension to the popular KinectFusion approach. In essence, our method allows to complete shapes in general indoor scenes behind what is sensed by the RGB-D camera, which may be crucial e.g. for robotic manipulation tasks or efficient scene exploration. Predicting object thicknesses rather than volumes allows us to work with comparably high spatial resolution without exploding memory and training data requirements on the employed Convolutional Neural Networks. In a series of qualitative and quantitative evaluations, we demonstrate how we accurately predict object thickness and reconstruct general 3D scenes containing multiple objects.

show abstract

Learning to Solve Nonlinear Least Squares for Monocular Stereo

Cited by 62 publications

References 23 publications

Unsupervised Deep Visual-Inertial Odometry with Online Error Correction for RGB-D Imagery

Unsupervised Deep Visual-Inertial Odometry with Online Error Correction for RGB-D Imagery

Taking a Deeper Look at the Inverse Compositional Algorithm

X-Section: Cross-Section Prediction for Enhanced RGB-D Fusion

Contact Info

Product

Resources

About