2022
DOI: 10.48550/arxiv.2205.02671
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

What is Right for Me is Not Yet Right for You: A Dataset for Grounding Relative Directions via Multi-Task Learning

Abstract: Understanding spatial relations is essential for intelligent agents to act and communicate in the physical world. Relative directions are spatial relations that describe the relative positions of target objects with regard to the intrinsic orientation of reference objects. Grounding relative directions is more difficult than grounding absolute directions because it not only requires a model to detect objects in the image and to identify spatial relation based on this information, but it also needs to recognize… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
0
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(3 citation statements)
references
References 7 publications
0
0
0
Order By: Relevance
“…For example, Kazemzadeh et al (2014) developed a dataset of referring expressions to ground objects in photographs of natural scenes. Lee et al (2022) curated another dataset to develop model for comprehending referring expressions through visual question-answering. One of the crucial limitations of these datasets is that the data samples are curated in non-embodied settings, where human presence and nonverbal gestures are not considered in referring expressions.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…For example, Kazemzadeh et al (2014) developed a dataset of referring expressions to ground objects in photographs of natural scenes. Lee et al (2022) curated another dataset to develop model for comprehending referring expressions through visual question-answering. One of the crucial limitations of these datasets is that the data samples are curated in non-embodied settings, where human presence and nonverbal gestures are not considered in referring expressions.…”
Section: Related Workmentioning
confidence: 99%
“…Multimodal Representation Learning: Several multimodal representation learning models have been proposed for various tasks, such as human activity recognition Iqbal 2020, 2021;Samyoun* et al 2022;Islam, Yasar, and Iqbal 2022;Feichtenhofer et al 2019), motion prediction (Yasar*, Islam*, and Iqbal 2022;Yasar and Iqbal 2021), visual-question answering (Lu et al 2019;Li et al 2019), and referring expression comprehension (Yu et al 2016;Mao et al 2016). Existing models for REF predominately use similar cross-attention or self-attention methods to fuse multimodal representations (Goyal et al 2020;Lee et al 2022). However, due to lack of perspective diversity and nonverbal gestures, these models do not explicitly learn the perspective taking and the understanding of human nonverbal interaction necessary to comprehend REF.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation