LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action

Shah, Dhruv; Osiński, Błażej; Ichter, Brian; Levine, Sergey

doi:10.48550/arxiv.2207.04429

Cited by 7 publications

(16 citation statements)

References 39 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The recent success of large pretrained vision and language models [10], [27] has spurred a flurry of interest in applying their zero-sot capabilities to different domains including object detection and segmentation [28], [29], [11], robot manipulation [30], [31], [32], [33], and navigation [13], [12], [34]. Most related to our work is the approach denoted LM-Nav [13], which combines three pre-trained models to navigate via a topological graph in the real world. CoW [12] performs zero-shot language-based object navigation by combining CLIP-based [10] saliency maps and traditional exploration methods.…”

Section: Related Workmentioning

confidence: 99%

“…CoW [12] performs zero-shot language-based object navigation by combining CLIP-based [10] saliency maps and traditional exploration methods. However, both LM-Nav [13] and CoW [12] are limited to navigating to object landmarks and are less capable to understand finer-grained queries, such as "to the left of the chair" and "in between the TV and the sofa". In contrast, our method enables spatial language indexing beyond object-centric goals and can generate open-vocabulary obstacle maps.…”

Section: Related Workmentioning

confidence: 99%

“…In this section, we describe our approach to long-horizon (spatial) goal navigation, given a set of landmark descriptions specified by natural language instructions such as move first to the left side of the counter, then move between the sink and the oven, then move back and forth to the sofa and the table twice Notably different from prior work [12], [13], VLMaps allow us to reference precise spatial goals such as: "in between the sofa at the TV" or "three meters to the east of the chair." Specifically, we use a large language model (LLM) to interpret the input natural language commands and break them down into subgoals [35], [13], [14].…”

Section: Zero-shot Spatial Goal Navigation From Languagementioning

confidence: 99%

“…Meanwhile, recent works show that visual-language models (VLMs) [10], [11] pretrained on Internet-scale data (e.g., image captions) can be used out-of-the-box to ground language to the visual observations of a navigating agent, without additional data collection or model fine-tuning. These models enable mobile robots to handle new instructions that specify unseen object goals and can be combined with exploration algorithms to search for the first instance of any object (CoW) [12] or traverse objectcentric landmarks in graphs (LM-Nav) [13]. While promising, these methods predominantly use VLMs as critics to match image observations to object goal descriptions, but do so in ways that remain disjoint from the mapping of the environment.…”

Section: Introductionmentioning

confidence: 99%

“…Extensive experiments show that using VLMaps enables more effective long-horizon multi-object goal navigation than baseline alternatives, e.g., CoW [12] and LM-Nav [13], and, in particular, excels at enabling spatial open-vocabulary navigation tasks. We also provide ablations on different ways of constructing VLMaps with different language models as well as a discussion on limitations, which point to areas for future work.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Visual Language Maps for Robot Navigation

Huang¹,

Mees²,

Zeng³

et al. 2022

Preprint

View full text Add to dashboard Cite

Grounding language to the visual observations of a navigating agent can be performed using off-the-shelf visuallanguage models pretrained on Internet-scale data (e.g., image captions). While this is useful for matching images to natural language descriptions of object goals, it remains disjoint from the process of mapping the environment, so that it lacks the spatial precision of classic geometric maps. To address this problem, we propose VLMaps, a spatial map representation that directly fuses pretrained visual-language features with a 3D reconstruction of the physical world. VLMaps can be autonomously built from video feed on robots using standard exploration approaches and enables natural language indexing of the map without additional labeled data. Specifically, when combined with large language models (LLMs), VLMaps can be used to (i) translate natural language commands into a sequence of open-vocabulary navigation goals (which, beyond prior work, can be spatial by construction, e.g., "in between the sofa and the TV" or "three meters to the right of the chair") directly localized in the map, and (ii) can be shared among multiple robots with different embodiments to generate new obstacle maps on-the-fly (by using a list of obstacle categories). Extensive experiments carried out in simulated and real-world environments show that VLMaps enable navigation according to more complex language instructions than existing methods. Videos are available at https://vlmaps.github.io.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Zero-shot Spatial Goal Navigation From Languagementioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Visual Language Maps for Robot Navigation

Huang¹,

Mees²,

Zeng³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

Application of Pretrained Large Language Models in Embodied Artificial Intelligence

Kovalev¹,

Panov

2022

Dokl. Math.

View full text Add to dashboard Cite

A feature of tasks in embodied artificial intelligence is that a query to an intelligent agent is formulated in natural language. As a result, natural language processing methods have to be used to transform the query into a format convenient for generating an appropriate action plan. There are two basic approaches to the solution of this problem. One is based on specialized models trained with particular instances of instructions translated into agent-executable format. The other approach relies on the ability of large language models trained with a large amount of unlabeled data to store common sense knowledge. As a result, such models can be used to generate an agent’s action plan in natural language without preliminary learning. This paper provides a detailed review of models based on the second approach as applied to embodied artificial intelligence tasks.

show abstract

Demonstrating Large Language Models on Robots

Google DeepMind

2023

Robotics: Science and Systems XIX

View full text Add to dashboard Cite

show abstract

LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action

Cited by 7 publications

References 39 publications

Visual Language Maps for Robot Navigation

Visual Language Maps for Robot Navigation

Application of Pretrained Large Language Models in Embodied Artificial Intelligence

Demonstrating Large Language Models on Robots

Contact Info

Product

Resources

About