2022
DOI: 10.48550/arxiv.2207.04429
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action

Abstract: Goal-conditioned policies for robotic navigation can be trained on large, unannotated datasets, providing for good generalization to real-world settings. However, particularly in vision-based settings where specifying goals requires an image, this makes for an unnatural interface. Language provides a more convenient modality for communication with robots, but contemporary methods typically require expensive supervision, in the form of trajectories annotated with language descriptions. We present a system, LM-N… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
16
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
2
1

Relationship

0
6

Authors

Journals

citations
Cited by 7 publications
(16 citation statements)
references
References 39 publications
0
16
0
Order By: Relevance
“…The recent success of large pretrained vision and language models [10], [27] has spurred a flurry of interest in applying their zero-sot capabilities to different domains including object detection and segmentation [28], [29], [11], robot manipulation [30], [31], [32], [33], and navigation [13], [12], [34]. Most related to our work is the approach denoted LM-Nav [13], which combines three pre-trained models to navigate via a topological graph in the real world. CoW [12] performs zero-shot language-based object navigation by combining CLIP-based [10] saliency maps and traditional exploration methods.…”
Section: Related Workmentioning
confidence: 99%
See 4 more Smart Citations
“…The recent success of large pretrained vision and language models [10], [27] has spurred a flurry of interest in applying their zero-sot capabilities to different domains including object detection and segmentation [28], [29], [11], robot manipulation [30], [31], [32], [33], and navigation [13], [12], [34]. Most related to our work is the approach denoted LM-Nav [13], which combines three pre-trained models to navigate via a topological graph in the real world. CoW [12] performs zero-shot language-based object navigation by combining CLIP-based [10] saliency maps and traditional exploration methods.…”
Section: Related Workmentioning
confidence: 99%
“…CoW [12] performs zero-shot language-based object navigation by combining CLIP-based [10] saliency maps and traditional exploration methods. However, both LM-Nav [13] and CoW [12] are limited to navigating to object landmarks and are less capable to understand finer-grained queries, such as "to the left of the chair" and "in between the TV and the sofa". In contrast, our method enables spatial language indexing beyond object-centric goals and can generate open-vocabulary obstacle maps.…”
Section: Related Workmentioning
confidence: 99%
See 3 more Smart Citations