2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021
DOI: 10.1109/iccv48922.2021.00181
|View full text |Cite
|
Sign up to set email alerts
|

InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
56
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 80 publications
(56 citation statements)
references
References 25 publications
0
56
0
Order By: Relevance
“…3D visual grounding Tab. 2 compares our results against prior 3D visual grounding methods ScanRefer [6], TGNN [25], InstanceRefer [62] and 3DVG-Transformer [64], and 3DVG-Trans+, an unpublished extension. Our method trained only with the detection loss and the listener loss (marked "Ours w/o fine-tuning"), i.e.…”
Section: Quantitative Resultsmentioning
confidence: 99%
See 2 more Smart Citations
“…3D visual grounding Tab. 2 compares our results against prior 3D visual grounding methods ScanRefer [6], TGNN [25], InstanceRefer [62] and 3DVG-Transformer [64], and 3DVG-Trans+, an unpublished extension. Our method trained only with the detection loss and the listener loss (marked "Ours w/o fine-tuning"), i.e.…”
Section: Quantitative Resultsmentioning
confidence: 99%
“…ScanRefer proposes the joint task of detecting and localizing objects in a 3D scan based on a textual description, while ReferIt3D is focused on distinguishing 3D objects from the same semantic class given ground-truth bounding boxes. Yuan et al [62] localize objects by decomposing input queries into fine-grained aspects, and used PointGroup [27] as their visual backbone. However, they used pre-computed instance predictions, so the detection backbone is not fine-tuned together with the localization module.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Object classification module predicts what objects are associated with a question. Note that many questions do not contain target object names related to the answer in contrast to the 3D localization task [10,50,51]. We use the 3D and question-aware fused feature f and feed it into a twolayer MLP to predict 18 ScanNet benchmark classes.…”
Section: Scanqa Modelmentioning
confidence: 99%
“…Language and Shape Works that explore the intersection between language and geometry have taken many forms, from resolving language references [2,3,36], to generating language descriptions of a shape [3,19], to generating a shape given a language description [22,34]. Most relevant to our work are the ones that attempt the language reference game, where the task is to select based on a language description a target shape out of a set of potential candidates either in a collection of individual 3D shapes [3,36] or within a scene [2,20,33,40,43,45]. While most of these works treat the reference game as a classification problem on the set of candidates, [20] outputs a segmentation mask over the scene.…”
Section: Related Workmentioning
confidence: 99%