2021
DOI: 10.48550/arxiv.2112.11691
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

CLEVR3D: Compositional Language and Elementary Visual Reasoning for Question Answering in 3D Real-World Scenes

Abstract: 3D scene understanding is a relatively emerging research field. In this paper, we introduce the Visual Question Answering task in 3D real-world scenes (VQA-3D), which aims to answer all possible questions given a 3D scene. To tackle this problem, the first VQA-3D dataset, namely CLEVR3D, is proposed, which contains 60K questions in 1,129 real-world scenes. Specifically, we develop a question engine leveraging 3D scene graph structures to generate diverse reasoning questions, covering the questions of objects' … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
1
1

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 29 publications
0
2
0
Order By: Relevance
“…is the main problem of 3D-QA. Recently, some 3D-QA datasets [1,2,8,9] have emerged at the same time. We choose ScanQA [2] as our research object since the other two are not open source yet.…”
Section: D Visual Question Answeringmentioning
confidence: 99%
“…is the main problem of 3D-QA. Recently, some 3D-QA datasets [1,2,8,9] have emerged at the same time. We choose ScanQA [2] as our research object since the other two are not open source yet.…”
Section: D Visual Question Answeringmentioning
confidence: 99%
“…Traditional 3D scenario understanding tasks mostly focus on individual targets and overlook target relations. Literature [ 43 ] introduced visual question answering in real 3D scenarios, aiming to answer all possible questions given a 3D scenario. They designed TransVQA3D, which first uses a cross-modal Transformer to fuse question and target characteristics.…”
Section: Related Workmentioning
confidence: 99%