No abstract
No abstract
Neural volumetric representations have shown the potential that Multi-layer Perceptrons (MLPs) can be optimized with multi-view calibrated images to represent scene geometry and appearance without explicit 3D supervision. Object segmentation can enrich many downstream applications based on the learned radiance field. However, introducing hand-crafted segmentation to define regions of interest in a complex real-world scene is non-trivial and expensive as it acquires per view annotation. This paper carries out the exploration of self-supervised learning for object segmentation using NeRF for complex real-world scenes. Our framework, called NeRF with Self-supervised Object Segmentation (NeRF-SOS), couples object segmentation and neural radiance field to segment objects in any view within a scene. By proposing a novel collaborative contrastive loss in both appearance and geometry levels, NeRF-SOS encourages NeRF models to distill compact geometry-aware segmentation clusters from their density fields and the self-supervised pre-trained 2D visual features. The self-supervised object segmentation framework can be applied to various NeRF models that both lead to photo-realistic rendering results and convincing segmentation maps for both indoor and outdoor scenarios. Extensive results on the LLFF, BlendedMVS, CO3Dv2, and Tank & Temples datasets validate the effectiveness of NeRF-SOS. It consistently surpasses other 2D-based self-supervised baselines and predicts finer object masks than existing supervised counterparts. Please refer to the video on our project page for more details: https://zhiwenfan.github.io/NeRF-SOS/.Recently, neural volumetric rendering techniques have shown great power in scene reconstruction. Especially, neural radiance field (NeRF) and its variants (Mildenhall et al., 2020a; Barron et al., 2021) adopt multi-layer perceptrons (MLPs) to learn continuous representation and utilize calibrated multi-view images to render unseen views with fine-grained details. Besides rendering quality, the ability of scene understanding has been explored by several recent works (Vora et al., 2021;Yang et al., 2021;Zhi et al., 2021). Nevertheless, they either require dense view annotations to train a heavy 3D backbone for capturing semantic representations (Vora et al., 2021;Yang et al., 2021), or necessitate human intervention to provide sparse semantic labels (Zhi et al., 2021). Recent self-supervised object discovery approaches on neural radiance fields (Yu et al., 2021c;Stelzner et al., 2021) try to decompose objects from givens scenes on the synthetic indoor data. However, still remains a gap to be applied in complex real-world scenarios.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.