2022
DOI: 10.48550/arxiv.2210.05714
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Visual Language Maps for Robot Navigation

Abstract: Grounding language to the visual observations of a navigating agent can be performed using off-the-shelf visuallanguage models pretrained on Internet-scale data (e.g., image captions). While this is useful for matching images to natural language descriptions of object goals, it remains disjoint from the process of mapping the environment, so that it lacks the spatial precision of classic geometric maps. To address this problem, we propose VLMaps, a spatial map representation that directly fuses pretrained visu… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
10
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
3

Relationship

0
6

Authors

Journals

citations
Cited by 7 publications
(10 citation statements)
references
References 37 publications
0
10
0
Order By: Relevance
“…Correspondingly, the encoder-decoder network needs to be modified such that it is suitable for dealing with these modified inputs and outputs (see details in Supplementary Note 1). We represent the subjects with 19-point 3D skeletons and model the indoor environment with a 3D visual-semantic map such that a given coordinate (e.g., (1.3 m, 1.2 m, 0.8 m)) can be associated with a semantic coordinate (e.g., "at the left side of the chair") 58 (see Fig. 6a and Supplementary Note 10).…”
Section: Experimental Results For 4d Compressive Microwave Metaimagingmentioning
confidence: 99%
“…Correspondingly, the encoder-decoder network needs to be modified such that it is suitable for dealing with these modified inputs and outputs (see details in Supplementary Note 1). We represent the subjects with 19-point 3D skeletons and model the indoor environment with a 3D visual-semantic map such that a given coordinate (e.g., (1.3 m, 1.2 m, 0.8 m)) can be associated with a semantic coordinate (e.g., "at the left side of the chair") 58 (see Fig. 6a and Supplementary Note 10).…”
Section: Experimental Results For 4d Compressive Microwave Metaimagingmentioning
confidence: 99%
“…A handful of contemporary studies, however, have begun to explore the utilization of generative models for navigation. Shah et al (Shah et al 2023) employs GPT-3 (Brown et al 2020) in an attempt to identify "landmarks" or subgoals, while Huang et al (Huang et al 2022) concentrates its efforts on the application of an LLM for the generation of code. Zhou et al (Zhou et al 2023) use LLM to extract the commonsense knowledge of the relations between targets and objects in observations to perform zero-shot object navigation (ZSON) (Gadre et al 2022;Majumdar et al 2022).…”
Section: Llms In Robotics Navigationmentioning
confidence: 99%
“…Large language models have been explored as an approach to high-level planning [14]- [18] and scene understanding [19], [20]. Vision-language models embedding image features into the same space as text have been applied to open vocabulary object detection [16], [17], natural language maps [15], [17], [21]- [23], and for language-informed navigation [24]- [26].…”
Section: Language Models In Roboticsmentioning
confidence: 99%
“…Such methods decompose the scene into a graph where edges model relations between parts of the scene. The geometry of the parts are typically represented as a signed distance functions stored in a voxel grid [15].…”
Section: Semantic Scene Representationsmentioning
confidence: 99%