Neural volumetric representations have shown the potential that Multi-layer Perceptrons (MLPs) can be optimized with multi-view calibrated images to represent scene geometry and appearance without explicit 3D supervision. Object segmentation can enrich many downstream applications based on the learned radiance field. However, introducing hand-crafted segmentation to define regions of interest in a complex real-world scene is non-trivial and expensive as it acquires per view annotation. This paper carries out the exploration of self-supervised learning for object segmentation using NeRF for complex real-world scenes. Our framework, called NeRF with Self-supervised Object Segmentation (NeRF-SOS), couples object segmentation and neural radiance field to segment objects in any view within a scene. By proposing a novel collaborative contrastive loss in both appearance and geometry levels, NeRF-SOS encourages NeRF models to distill compact geometry-aware segmentation clusters from their density fields and the self-supervised pre-trained 2D visual features. The self-supervised object segmentation framework can be applied to various NeRF models that both lead to photo-realistic rendering results and convincing segmentation maps for both indoor and outdoor scenarios. Extensive results on the LLFF, BlendedMVS, CO3Dv2, and Tank & Temples datasets validate the effectiveness of NeRF-SOS. It consistently surpasses other 2D-based self-supervised baselines and predicts finer object masks than existing supervised counterparts. Please refer to the video on our project page for more details: https://zhiwenfan.github.io/NeRF-SOS/.Recently, neural volumetric rendering techniques have shown great power in scene reconstruction. Especially, neural radiance field (NeRF) and its variants (Mildenhall et al., 2020a; Barron et al., 2021) adopt multi-layer perceptrons (MLPs) to learn continuous representation and utilize calibrated multi-view images to render unseen views with fine-grained details. Besides rendering quality, the ability of scene understanding has been explored by several recent works (Vora et al., 2021;Yang et al., 2021;Zhi et al., 2021). Nevertheless, they either require dense view annotations to train a heavy 3D backbone for capturing semantic representations (Vora et al., 2021;Yang et al., 2021), or necessitate human intervention to provide sparse semantic labels (Zhi et al., 2021). Recent self-supervised object discovery approaches on neural radiance fields (Yu et al., 2021c;Stelzner et al., 2021) try to decompose objects from givens scenes on the synthetic indoor data. However, still remains a gap to be applied in complex real-world scenarios.