InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring

Yuan, Zhihao; Xu, Yan; Liao, Yinghong; Zhang, Ruimao; Li, Zhen; Cui, Shuguang

doi:10.1109/iccv48922.2021.00181

Cited by 80 publications

(56 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…3D visual grounding Tab. 2 compares our results against prior 3D visual grounding methods ScanRefer [6], TGNN [25], InstanceRefer [62] and 3DVG-Transformer [64], and 3DVG-Trans+, an unpublished extension. Our method trained only with the detection loss and the listener loss (marked "Ours w/o fine-tuning"), i.e.…”

Section: Quantitative Resultsmentioning

confidence: 99%

“…ScanRefer proposes the joint task of detecting and localizing objects in a 3D scan based on a textual description, while ReferIt3D is focused on distinguishing 3D objects from the same semantic class given ground-truth bounding boxes. Yuan et al [62] localize objects by decomposing input queries into fine-grained aspects, and used PointGroup [27] as their visual backbone. However, they used pre-computed instance predictions, so the detection backbone is not fine-tuned together with the localization module.…”

Section: Related Workmentioning

confidence: 99%

“…Recently, there has been increasing interest in bridging 3D visual scene understanding [5,13,19,20,24,44,49] and natural language processing [4,14,36,52,59]. The task of 3D visual grounding [6,62,64] localizes 3D objects described by natural language queries. 3D dense captioning proposed by Chen et al [7] is the reverse task where we generate descriptions for 3D objects in RGB-D scans.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding

Chen¹,

Wu²,

Nießner³

et al. 2021

Preprint

View full text Add to dashboard Cite

Listener discriminates" this is a black chair . it is against the wall and facing the door. " 4 " this chair is black . it is under the whiteboard facing the round table ."2 " … " 3 " … " 5 " … " 1 Dense captions Detections Figure 1. We introduce D 3 Net, an end-to-end neural speaker-listener architecture that can detect, describe and discriminate. D 3 Net also enables semi-supervised training on Scan-Net data with partially annotated descriptions.

show abstract

Section: Quantitative Resultsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding

Chen¹,

Wu²,

Nießner³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Object classification module predicts what objects are associated with a question. Note that many questions do not contain target object names related to the answer in contrast to the 3D localization task [10,50,51]. We use the 3D and question-aware fused feature f and feed it into a twolayer MLP to predict 18 ScanNet benchmark classes.…”

Section: Scanqa Modelmentioning

confidence: 99%

ScanQA: 3D Question Answering for Spatial Scene Understanding

Azuma¹,

Miyanishi²,

Kurita³

et al. 2021

Preprint

View full text Add to dashboard Cite

We propose a new 3D spatial understanding task of 3D Question Answering (3D-QA). In the 3D-QA task, models receive visual information from the entire 3D scene of the rich RGB-D indoor scan and answer the given textual questions about the 3D scene. Unlike the 2D-question answering of VQA, the conventional 2D-QA models suffer from problems with spatial understanding of object alignment and directions and fail the object localization from the textual questions in 3D-QA. We propose a baseline model for 3D-QA, named ScanQA model, where the model learns a fused descriptor from 3D object proposals and encoded sentence embeddings. This learned descriptor correlates the language expressions with the underlying geometric features of the 3D scan and facilitates the regression of 3D bounding boxes to determine described objects in textual questions. We collected human-edited questionanswer pairs with free-form answers that are grounded to 3D objects in each 3D scene. Our new ScanQA dataset contains over 41K question-answer pairs from the 800 indoor scenes drawn from the ScanNet dataset. To the best of our knowledge, ScanQA is the first large-scale effort to perform object-grounded question-answering in 3D environments. * denotes equally contributed. Question + 3D-ScanQ. Where is the medium sized blue suitcase laid?A. in front of right bed 3D ScanAnswer + 3D-Bounding Box 3D ScanWhat is sitting on the floor between the tv and the wooden chair?A. 2 black backpacks Q.

show abstract

“…Language and Shape Works that explore the intersection between language and geometry have taken many forms, from resolving language references [2,3,36], to generating language descriptions of a shape [3,19], to generating a shape given a language description [22,34]. Most relevant to our work are the ones that attempt the language reference game, where the task is to select based on a language description a target shape out of a set of potential candidates either in a collection of individual 3D shapes [3,36] or within a scene [2,20,33,40,43,45]. While most of these works treat the reference game as a classification problem on the set of candidates, [20] outputs a segmentation mask over the scene.…”

Section: Related Workmentioning

confidence: 99%

PartGlot: Learning Shape Part Segmentation from Language Reference Games

Koo¹,

Huang²,

Achlioptas³

et al. 2021

Preprint

View full text Add to dashboard Cite

this chair has an oval back" "totally solid no leg" Input (Super-Seg.) Back Seat Leg Arm Output Segments Attention Maps 0 1 (a) Referential Language for Shapes (b) Language-guided 3D Part Segmentations via Neural Attention Target Target Figure 1.Overview. On the left panel, we present examples of referential language distinguishing the shape of a "target" geometry (enclosed inside a green box) from two "distractor" objects. Using such language our proposed task is to estimate directly in 3D-space semantic part segmentations of objects. On the right panel, we present the key ingredients of the neural architecture to facilitate this goal: given referential language and unsupervised 3D super-segments of shapes, we learn a set of attention maps that corresponds to semantic shape parts (when properly regularized), discovered solely by solving the language-reference problem of identifying the target shape. Tapping on the zero-shot learning capacity of natural language learners, and the shared part-composition of common objects, we find examples of zero-shot segmentations on a table and lamp objects, extracted from learners and language concerning only chair-based comparisons.

show abstract

InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring

Cited by 80 publications

References 25 publications

D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding

D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding

ScanQA: 3D Question Answering for Spatial Scene Understanding

PartGlot: Learning Shape Part Segmentation from Language Reference Games

Contact Info

Product

Resources

About