Atlas: End-to-End 3D Scene Reconstruction from Posed Images

Murez, Zak; As, Tarrence van; Bartolozzi, James; Sinha, Ayan; Badrinarayanan, Vijay; Rabinovich, Andrew

doi:10.1007/978-3-030-58571-6_25

Cited by 190 publications

(124 citation statements)

References 57 publications

Supporting

Mentioning

124

Contrasting

Order By: Relevance

“…With the success of deep learning, a number of learning-based techniques are proposed to tackle the problem. While several methods learn to directly predict 3D geometry as grids [25,24], point clouds [7] and TSDF [35], per-view depth map estimation is still the top choice of most approaches [47,51,23,32,21,29,33,10,36] due to its robustness and flexibility. Most of those methods follow the spirit of conventional approaches [14,6] and train a cost volume based neural network.…”

Section: Related Workmentioning

confidence: 99%

A Confidence-based Iterative Solver of Depths and Surface Normals for Deep Multi-view Stereo

Wang¹,

Liu²,

Yi³

et al. 2022

Preprint

View full text Add to dashboard Cite

In this paper, we introduce a deep multi-view stereo (MVS) system that jointly predicts depths, surface normals and per-view confidence maps. The key to our approach is a novel solver that iteratively solves for per-view depth map and normal map by optimizing an energy potential based on the locally planar assumption. Specifically, the algorithm updates depth map by propagating from neighboring pixels with slanted planes, and updates normal map with local probabilistic plane fitting. Both two steps are monitored by a customized confidence map. This solver is not only effective as a post-processing tool for plane-based depth refinement and completion, but also differentiable such that it can be efficiently integrated into deep learning pipelines. Our multi-view stereo system employs multiple optimization steps of the solver over the initial prediction of depths and surface normals. The whole system can be trained end-to-end, decoupling the challenging problem of matching pixels within poorly textured regions from the cost-volume based neural network. Experimental results on ScanNet and RGB-D Scenes V2 demonstrate state-ofthe-art performance of the proposed deep MVS system on multi-view depth estimation, with our proposed solver consistently improving the depth quality over both conventional and deep learning based MVS pipelines. Code is available at https://github.com/thuzhaowang/idn-solver.

show abstract

Section: Related Workmentioning

confidence: 99%

A Confidence-based Iterative Solver of Depths and Surface Normals for Deep Multi-view Stereo

Wang¹,

Liu²,

Yi³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Recently, neural implicit representations demonstrated promising results for object geometry representation [7, 18, 20, 28, 30-32, 36, 50, 54, 57, 58], scene completion [5,14,33], novel view synthesis [19,21,34,60] and also generative modelling [6,26,27,39]. A few recent papers [1,3,8,23,44] attempt to predict scene-level geometry with RGB-(D) inputs, but they all assume given camera poses. Another set of works [17,51,59] tackle the problem of camera pose optimization, but they need a rather long optimization process, which is not suitable for real-time applications.…”

Section: Related Workmentioning

confidence: 99%

NICE-SLAM: Neural Implicit Scalable Encoding for SLAM

Zhu¹,

Peng²,

Larsson³

et al. 2021

Preprint

View full text Add to dashboard Cite

Neural implicit representations have recently shown encouraging results in various domains, including promising progress in simultaneous localization and mapping (SLAM). Nevertheless, existing methods produce oversmoothed scene reconstructions and have difficulty scaling up to large scenes. These limitations are mainly due to their simple fully-connected network architecture that does not incorporate local information in the observations. In this paper, we present NICE-SLAM, a dense SLAM system that incorporates multi-level local information by introducing a hierarchical scene representation. Optimizing this representation with pre-trained geometric priors enables detailed reconstruction on large indoor scenes. Compared to recent neural implicit SLAM systems, our approach is more scalable, efficient, and robust. Experiments on five challenging datasets demonstrate competitive results of NICE-SLAM in both mapping and tracking quality. Project page: https:// pengsongyou.github.io/ nice-slam.

show abstract

“…Our method fuses input view features using a transformer. We compare to Atlas [28], which fuses features by averaging, and NeuralRecon [37], which fuses locally by averaging and globally by RNN. Our method produces a high level of detail, while also filling in holes due to occlusion and unobserved regions.…”

Section: Introductionmentioning

confidence: 99%

“…Recently, a number of works have addressed this by posing RGB-only 3D reconstruction as the direct prediction of a truncated signed-distance function (TSDF), using deep learning to fill in unobserved regions via learned priors [28,37]. These methods extract image features using a convolutional neural network (CNN), accumulate them into space by backprojecting onto a 3D grid, and then predict the TSDF volume using a 3D CNN.…”

Section: Introductionmentioning

confidence: 99%

VoRTX: Volumetric 3D Reconstruction With Transformers for Voxelwise View Selection and Fusion

Stier¹,

Rich²,

Sen³

et al. 2021

Preprint

View full text Add to dashboard Cite

Recent volumetric 3D reconstruction methods can produce very accurate results, with plausible geometry even for unobserved surfaces. However, they face an undesirable trade-off when it comes to multi-view fusion. They can fuse all available view information by global averaging, thus losing fine detail, or they can heuristically cluster views for local fusion, thus restricting their ability to consider all views jointly. Our key insight is that greater detail can be retained without restricting view diversity by learning a view-fusion function conditioned on camera pose and image content. We propose to learn this multi-view fusion using a transformer. To this end, we introduce VoRTX, 1 an end-to-end volumetric 3D reconstruction network using transformers for wide-baseline, multi-view feature fusion. Our model is occlusion-aware, leveraging the transformer architecture to predict an initial, projective scene geometry estimate. This estimate is used to avoid backprojecting image features through surfaces into occluded regions. We train our model on ScanNet and show that it produces better reconstructions than state-of-the-art methods. We also demonstrate generalization without any finetuning, outperforming the same state-of-the-art methods on two other datasets, TUM-RGBD and ICL-NUIM.

show abstract

Atlas: End-to-End 3D Scene Reconstruction from Posed Images

Cited by 190 publications

References 57 publications

A Confidence-based Iterative Solver of Depths and Surface Normals for Deep Multi-view Stereo

A Confidence-based Iterative Solver of Depths and Surface Normals for Deep Multi-view Stereo

NICE-SLAM: Neural Implicit Scalable Encoding for SLAM

VoRTX: Volumetric 3D Reconstruction With Transformers for Voxelwise View Selection and Fusion

Contact Info

Product

Resources

About