2022
DOI: 10.48550/arxiv.2205.03923
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Unsupervised Discovery and Composition of Object Light Fields

Abstract: Neural scene representations, both continuous and discrete, have recently emerged as a powerful new paradigm for 3D scene understanding. Recent efforts have tackled unsupervised discovery of object-centric neural scene representations. However, the high cost of ray-marching, exacerbated by the fact that each object representation has to be ray-marched separately, leads to insufficiently sampled radiance fields and thus, noisy renderings, poor framerates, and high memory and time complexity during training and … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
4
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
2
1

Relationship

1
2

Authors

Journals

citations
Cited by 3 publications
(4 citation statements)
references
References 38 publications
0
4
0
Order By: Relevance
“…The predominant way to identify the objects present in a scene is to segment twodimensional images using extensive manual annotation (Kirillov et al, 2023;Wang et al, 2023a), but relying on human supervision introduces challenges and scales poorly to 3D data. As an alternative, an extensive line of work on unsupervised object discovery (Russell et al, 2006;Rubinstein et al, 2013;Oktay et al, 2018;Hénaff et al, 2022;Smith et al, 2022;Ye et al, 2022;Monnier et al, 2023) proposes different inductive biases (Locatello et al, 2019) that encourage awareness of objects in a scene. However, these approaches are largely restricted to either 2D images or constrained 3D data (Yu et al, 2021;Sajjadi et al, 2022), limiting their applicability to complex 3D scenes.…”
Section: Related Workmentioning
confidence: 99%
“…The predominant way to identify the objects present in a scene is to segment twodimensional images using extensive manual annotation (Kirillov et al, 2023;Wang et al, 2023a), but relying on human supervision introduces challenges and scales poorly to 3D data. As an alternative, an extensive line of work on unsupervised object discovery (Russell et al, 2006;Rubinstein et al, 2013;Oktay et al, 2018;Hénaff et al, 2022;Smith et al, 2022;Ye et al, 2022;Monnier et al, 2023) proposes different inductive biases (Locatello et al, 2019) that encourage awareness of objects in a scene. However, these approaches are largely restricted to either 2D images or constrained 3D data (Yu et al, 2021;Sajjadi et al, 2022), limiting their applicability to complex 3D scenes.…”
Section: Related Workmentioning
confidence: 99%
“…Prior work has aimed to infer object-centric representations directly from images, with objects either represented as localized object-centric patches [52][53][54][55][56] or scene mixture components [2,[57][58][59][60][61], with the slot attention module [1] increasingly driving object-centric inference. Resulting object representations may be decoded into object-centric 3D representations and composed for novel view synthesis [4,6,[62][63][64][65][66][67][68]. BlockGAN and GIRAFFE [69,70] build unconditional generative models for compositions of 3D-structured representations, but only tackle generation, not reconstruction.…”
Section: Related Workmentioning
confidence: 99%
“…These methods segment an image or video into non-overlapping objects and infer a latent code for each of them, however, they are either constrained to simple toy environments, or require video with additional annotations, such as bounding boxes, at test time. The resulting object-centric latent codes can be decoded into object-centric 3D scene representations [4][5][6]. Here, 3D notions, such as 3D scale and connectedness, can serve as additional training signal, but methods are similarly limited to simple scenes.…”
Section: Introductionmentioning
confidence: 99%
“…In this line, the focus has been to design an appropriate decoder that supports good decomposition. The most widely used decoders include the mixture-decoder[6,25,47,26,19,69,1,17,18,36,75], spatial transformer decoder[20,12,44,34,13,9], Neural Radiance Fields (NeRF)[68,73,65],…”
mentioning
confidence: 99%