2021
DOI: 10.48550/arxiv.2112.01551
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding

Abstract: Listener discriminates" this is a black chair . it is against the wall and facing the door. " 4 " this chair is black . it is under the whiteboard facing the round table ."2 " … " 3 " … " 5 " … " 1 Dense captions Detections Figure 1. We introduce D 3 Net, an end-to-end neural speaker-listener architecture that can detect, describe and discriminate. D 3 Net also enables semi-supervised training on Scan-Net data with partially annotated descriptions.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...

Citation Types

0
0
0

Publication Types

Select...

Relationship

0
0

Authors

Journals

citations
Cited by 0 publications
references
References 46 publications
(124 reference statements)
0
0
0
Order By: Relevance

No citations

Set email alert for when this publication receives citations?