Proceedings of the 29th ACM International Conference on Multimedia 2021
DOI: 10.1145/3474085.3475397
|View full text |Cite
|
Sign up to set email alerts
|

TransRefer3D: Entity-and-Relation Aware Transformer for Fine-Grained 3D Visual Grounding

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
17
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 45 publications
(17 citation statements)
references
References 18 publications
0
17
0
Order By: Relevance
“…However, the sparse, noisy, and limited semantic information of point clouds compared to 2D images make it difficult to accurately locate a referred object [36]. Additionally, the proximity of the referent to adjacent objects in the scene can lead to localization errors [2,1,4,40], and view-dependent descriptions can result in poor localization performance for referent localization based on spatial terms [15,36,40,14,13,39,12]. There are also localization errors when locating a unique referent among multiple visually similar objects [2,21,15,36,40,14,13,39,12,19,1,4].…”
Section: Non-interactive Visual Groundingmentioning
confidence: 99%
See 1 more Smart Citation
“…However, the sparse, noisy, and limited semantic information of point clouds compared to 2D images make it difficult to accurately locate a referred object [36]. Additionally, the proximity of the referent to adjacent objects in the scene can lead to localization errors [2,1,4,40], and view-dependent descriptions can result in poor localization performance for referent localization based on spatial terms [15,36,40,14,13,39,12]. There are also localization errors when locating a unique referent among multiple visually similar objects [2,21,15,36,40,14,13,39,12,19,1,4].…”
Section: Non-interactive Visual Groundingmentioning
confidence: 99%
“…Additionally, the proximity of the referent to adjacent objects in the scene can lead to localization errors [2,1,4,40], and view-dependent descriptions can result in poor localization performance for referent localization based on spatial terms [15,36,40,14,13,39,12]. There are also localization errors when locating a unique referent among multiple visually similar objects [2,21,15,36,40,14,13,39,12,19,1,4]. Our approach introduces a new task of 3D visual grounding in a humanin-the-loop-based scenario, where body gestures are integrated into the scene to mitigate localization errors resulting from sparse, noisy, and semantically limited point clouds, object proximity, difficulty in distinguishing a unique referent among visually similar objects, and view-dependent descriptions.…”
Section: Non-interactive Visual Groundingmentioning
confidence: 99%
“…ReferIt3DNet [2] utilizes a graph convolutional network with input objects as nodes of the graph. 3DRefTransformer [1], LanguageRefer [30] TransRefer [16], and SAT [37] are Transformerbased methods that operate on language and 3D object point clouds. 3DRefTransformer [1] is an end-to-end Transformer model that incorporates an object pairwise spatial relation loss.…”
Section: Related Workmentioning
confidence: 99%
“…LanguageRefer [30] uses a Transformer architecture over bounding box embeddings and language embedding from DistilBert [31]. TransRefer [16] utilizes a Transformerbased network to extract entity-and-relation-aware represen-…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation