2022
DOI: 10.1007/978-3-031-20059-5_24
|View full text |Cite
|
Sign up to set email alerts
|

Bottom Up Top Down Detection Transformers for Language Grounding in Images and Point Clouds

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
21
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
5
1

Relationship

0
6

Authors

Journals

citations
Cited by 42 publications
(21 citation statements)
references
References 38 publications
0
21
0
Order By: Relevance
“…For example, Text-guided Graph Neural Network [17] conducts instance segmentation on the full scene to create candidate objects as input to a graph neural network [32]; InstanceRefer [39] selects instance candidates from the panoptic segmentation of point clouds; 3DVG-Transformer [40] uses outputs from an object proposal generation module to fully leverage contextual clues for cross-modal proposal disambiguation. The best performing work in this category, BUTD-DETR [20], uses box proposals from a pretrained detector and scene features from the full 3D scene to decode objects with a detection head. The Multi-View Transformer [18] separately models the scene by projecting the 3D scene to a multi-view space, to eliminate dependence on specific views and learn robust representations.…”
Section: Related Workmentioning
confidence: 99%
See 4 more Smart Citations
“…For example, Text-guided Graph Neural Network [17] conducts instance segmentation on the full scene to create candidate objects as input to a graph neural network [32]; InstanceRefer [39] selects instance candidates from the panoptic segmentation of point clouds; 3DVG-Transformer [40] uses outputs from an object proposal generation module to fully leverage contextual clues for cross-modal proposal disambiguation. The best performing work in this category, BUTD-DETR [20], uses box proposals from a pretrained detector and scene features from the full 3D scene to decode objects with a detection head. The Multi-View Transformer [18] separately models the scene by projecting the 3D scene to a multi-view space, to eliminate dependence on specific views and learn robust representations.…”
Section: Related Workmentioning
confidence: 99%
“…Modules in NS3D can be trained end-to-end with only the groundtruth referred objects as supervision; each can also be trained individually whenever additional labels are available. In this paper, we use a hybrid training objective similar to prior works [2,20]. Specifically, we use the groundtruth object category to compute a per-object classification loss L oce (applied to all prob c , where c is the category) and the groundtruth final target object to compute a per-expression loss L tce .…”
Section: Trainingmentioning
confidence: 99%
See 3 more Smart Citations