ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings

Majumdar, Arjun; Aggarwal, Gunjan; Devnani, Bhavika; Hoffman, Judy; Batra, Dhruv

doi:10.48550/arxiv.2206.12403

Cited by 4 publications

(5 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Shah et al (Shah et al 2023) employs GPT-3 (Brown et al 2020) in an attempt to identify "landmarks" or subgoals, while Huang et al (Huang et al 2022) concentrates its efforts on the application of an LLM for the generation of code. Zhou et al (Zhou et al 2023) use LLM to extract the commonsense knowledge of the relations between targets and objects in observations to perform zero-shot object navigation (ZSON) (Gadre et al 2022;Majumdar et al 2022). Despite these recent advancements, our study diverges in its concentration on converting visual scene semantics into input prompts for the LLM, directly performing VLN based on the commonsense knowledge and reasoning ability of LLMs.…”

Section: Llms In Robotics Navigationmentioning

confidence: 99%

NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models

Zhou,

Hong,

2024

AAAI

View full text Add to dashboard Cite

Trained with an unprecedented scale of data, large language models (LLMs) like ChatGPT and GPT-4 exhibit the emergence of significant reasoning abilities from model scaling. Such a trend underscored the potential of training LLMs with unlimited language data, advancing the development of a universal embodied agent. In this work, we introduce the NavGPT, a purely LLM-based instruction-following navigation agent, to reveal the reasoning capability of GPT models in complex embodied scenes by performing zero-shot sequential action prediction for vision-and-language navigation (VLN). At each step, NavGPT takes the textual descriptions of visual observations, navigation history, and future explorable directions as inputs to reason the agent's current status, and makes the decision to approach the target. Through comprehensive experiments, we demonstrate NavGPT can explicitly perform high-level planning for navigation, including decomposing instruction into sub-goals, integrating commonsense knowledge relevant to navigation task resolution, identifying landmarks from observed scenes, tracking navigation progress, and adapting to exceptions with plan adjustment. Furthermore, we show that LLMs is capable of generating high-quality navigational instructions from observations and actions along a path, as well as drawing accurate top-down metric trajectory given the agent's navigation history. Despite the performance of using NavGPT to zero-shot R2R tasks still falling short of trained models, we suggest adapting multi-modality inputs for LLMs to use as visual navigation agents and applying the explicit reasoning of LLMs to benefit learning-based models. Code is available at: https://github.com/GengzeZhou/NavGPT.

show abstract

Section: Llms In Robotics Navigationmentioning

confidence: 99%

NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models

Zhou,

Hong,

2024

AAAI

View full text Add to dashboard Cite

show abstract

“…Note that the latter depends on making correct associations between landmarks and observations, which is known as the correspondence problem in SLAM, and is typically implicit in the notation used to describe algorithms. A single robot state x t is generally a pose-a concatenation of position (translation) t and orientation (rotation) R. t may belong to R 2 or R 3 for position in 2D or 3D space, respectively, while the corresponding R is typically a matrix in SO (2) or SO(3)-the special orthogonal groups consisting of orthogonal 2 × 2 and 3 × 3 matrices with determinant 1. The Cartesian product R n × SO(n) then constitutes the n-dimensional special Euclidean group SE(n), with composition:…”

Section: Full Slammentioning

confidence: 99%

“…The localization of objects in the robot's immediate surroundings once there is performed by learned policies conditioned on sensory observations. It is not hard to imagine that this kind of system could be combined with something like the approaches demonstrated in [2,85]. Given a sequence of navigation instructions in natural language, the former has an LLM to parse out the landmark descriptors, which are compared against sensory observations by a vision-language model (VLM), to produce a topological path through a visual-navigation model (VNM).…”

Section: Robot Navigation Without the Construction Of Mapsmentioning

confidence: 99%

“…While stationary industrial robot manipulators can be programmed to execute a specific task without reference to their external environment, perception is critical for interacting with dynamic environments [1]. In the context of mobile robotics, this often (though not always [2]) requires the construction of a map, which is readily apparent in the interfaces for widely-used navigation and planning packages [3]. For solving more complex tasks, the extension of basic spatial obstacle maps with topological structure [4] and semantic information [5] has been an active area of research for multiple decades now.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Constructing Maps for Autonomous Robotics: An Introductory Conceptual Overview

2023

View full text Add to dashboard Cite

Mapping the environment is a powerful technique for enabling autonomy through localization and planning in robotics. This article seeks to provide a global overview of actionable map construction in robotics, outlining the basic problems, introducing techniques for overcoming them, and directing the reader toward established research covering these problem and solution domains in more detail. Multiple levels of abstraction are covered in a non-exhaustive vertical slice, starting with the fundamental problem of constructing metric occupancy grids with Simultaneous Mapping and Localization techniques. On top of these, topological meshes and semantic maps are reviewed, and a comparison is drawn between multiple representation formats. Furthermore, the datasets and metrics used in performance benchmarks are discussed, as are the challenges faced in some domains that deviate from typical laboratory conditions. Finally, recent advances in robot control without explicit map construction are touched upon.

show abstract

“…Large language models have been explored as an approach to high-level planning [14]- [18] and scene understanding [19], [20]. Vision-language models embedding image features into the same space as text have been applied to open vocabulary object detection [16], [17], natural language maps [15], [17], [21]- [23], and for language-informed navigation [24]- [26].…”

Section: Language Models In Roboticsmentioning

confidence: 99%

Neural Implicit Vision-Language Feature Fields

Blomqvist¹,

Milano²,

Chung³

et al. 2023

Preprint

View full text Add to dashboard Cite

Recently, groundbreaking results have been presented on open-vocabulary semantic image segmentation. Such methods segment each pixel in an image into arbitrary categories provided at run-time in the form of text prompts, as opposed to a fixed set of classes defined at training time. In this work, we present a zero-shot volumetric open-vocabulary semantic scene segmentation method. Our method builds on the insight that we can fuse image features from a visionlanguage model into a neural implicit representation. We show that the resulting feature field can be segmented into different classes by assigning points to natural language text prompts. The implicit volumetric representation enables us to segment the scene both in 3D and 2D by rendering feature maps from any given viewpoint of the scene. We show that our method works on noisy real-world data and can run in real-time on live sensor data dynamically adjusting to text prompts. We also present quantitative comparisons on the ScanNet dataset.

show abstract

ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings

Cited by 4 publications

References 21 publications

NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models

NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models

Constructing Maps for Autonomous Robotics: An Introductory Conceptual Overview

Neural Implicit Vision-Language Feature Fields

Contact Info

Product

Resources

About