CoWs on Pasture: Baselines and Benchmarks for Language-Driven Zero-Shot Object Navigation

Yitzhak, Gadre, Samir; Wortsman, Mitchell; Ilharco, Gabriel; Schmidt, Ludwig; Song, Shuran

doi:10.48550/arxiv.2203.10421

Cited by 2 publications

(8 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Zero-shot Models. The recent success of large pretrained vision and language models [10], [27] has spurred a flurry of interest in applying their zero-sot capabilities to different domains including object detection and segmentation [28], [29], [11], robot manipulation [30], [31], [32], [33], and navigation [13], [12], [34]. Most related to our work is the approach denoted LM-Nav [13], which combines three pre-trained models to navigate via a topological graph in the real world.…”

Section: Related Workmentioning

confidence: 99%

“…Most related to our work is the approach denoted LM-Nav [13], which combines three pre-trained models to navigate via a topological graph in the real world. CoW [12] performs zero-shot language-based object navigation by combining CLIP-based [10] saliency maps and traditional exploration methods. However, both LM-Nav [13] and CoW [12] are limited to navigating to object landmarks and are less capable to understand finer-grained queries, such as "to the left of the chair" and "in between the TV and the sofa".…”

Section: Related Workmentioning

confidence: 99%

“…CoW [12] performs zero-shot language-based object navigation by combining CLIP-based [10] saliency maps and traditional exploration methods. However, both LM-Nav [13] and CoW [12] are limited to navigating to object landmarks and are less capable to understand finer-grained queries, such as "to the left of the chair" and "in between the TV and the sofa". In contrast, our method enables spatial language indexing beyond object-centric goals and can generate open-vocabulary obstacle maps.…”

Section: Related Workmentioning

confidence: 99%

“…Meanwhile, recent works show that visual-language models (VLMs) [10], [11] pretrained on Internet-scale data (e.g., image captions) can be used out-of-the-box to ground language to the visual observations of a navigating agent, without additional data collection or model fine-tuning. These models enable mobile robots to handle new instructions that specify unseen object goals and can be combined with exploration algorithms to search for the first instance of any object (CoW) [12] or traverse objectcentric landmarks in graphs (LM-Nav) [13]. While promising, these methods predominantly use VLMs as critics to match image observations to object goal descriptions, but do so in ways that remain disjoint from the mapping of the environment.…”

Section: Introductionmentioning

confidence: 99%

“…Extensive experiments show that using VLMaps enables more effective long-horizon multi-object goal navigation than baseline alternatives, e.g., CoW [12] and LM-Nav [13], and, in particular, excels at enabling spatial open-vocabulary navigation tasks. We also provide ablations on different ways of constructing VLMaps with different language models as well as a discussion on limitations, which point to areas for future work.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Visual Language Maps for Robot Navigation

Huang¹,

Mees²,

Zeng³

et al. 2022

Preprint

View full text Add to dashboard Cite

Grounding language to the visual observations of a navigating agent can be performed using off-the-shelf visuallanguage models pretrained on Internet-scale data (e.g., image captions). While this is useful for matching images to natural language descriptions of object goals, it remains disjoint from the process of mapping the environment, so that it lacks the spatial precision of classic geometric maps. To address this problem, we propose VLMaps, a spatial map representation that directly fuses pretrained visual-language features with a 3D reconstruction of the physical world. VLMaps can be autonomously built from video feed on robots using standard exploration approaches and enables natural language indexing of the map without additional labeled data. Specifically, when combined with large language models (LLMs), VLMaps can be used to (i) translate natural language commands into a sequence of open-vocabulary navigation goals (which, beyond prior work, can be spatial by construction, e.g., "in between the sofa and the TV" or "three meters to the right of the chair") directly localized in the map, and (ii) can be shared among multiple robots with different embodiments to generate new obstacle maps on-the-fly (by using a list of obstacle categories). Extensive experiments carried out in simulated and real-world environments show that VLMaps enable navigation according to more complex language instructions than existing methods. Videos are available at https://vlmaps.github.io.

show abstract