2022
DOI: 10.48550/arxiv.2206.12403
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings

Abstract: We present a scalable approach for learning open-world object-goal navigation (ObjectNav) -the task of asking a virtual robot (agent) to find any instance of an object in an unexplored environment (e.g., "find a sink"). Our approach is entirely zero-shot -i.e., it does not require ObjectNav rewards or demonstrations of any kind. Instead, we train on the image-goal navigation (ImageNav) task, in which agents find the location where a picture (i.e., goal image) was captured. Specifically, we encode goal images i… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
5
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
3
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(5 citation statements)
references
References 21 publications
0
5
0
Order By: Relevance
“…Shah et al (Shah et al 2023) employs GPT-3 (Brown et al 2020) in an attempt to identify "landmarks" or subgoals, while Huang et al (Huang et al 2022) concentrates its efforts on the application of an LLM for the generation of code. Zhou et al (Zhou et al 2023) use LLM to extract the commonsense knowledge of the relations between targets and objects in observations to perform zero-shot object navigation (ZSON) (Gadre et al 2022;Majumdar et al 2022). Despite these recent advancements, our study diverges in its concentration on converting visual scene semantics into input prompts for the LLM, directly performing VLN based on the commonsense knowledge and reasoning ability of LLMs.…”
Section: Llms In Robotics Navigationmentioning
confidence: 99%
“…Shah et al (Shah et al 2023) employs GPT-3 (Brown et al 2020) in an attempt to identify "landmarks" or subgoals, while Huang et al (Huang et al 2022) concentrates its efforts on the application of an LLM for the generation of code. Zhou et al (Zhou et al 2023) use LLM to extract the commonsense knowledge of the relations between targets and objects in observations to perform zero-shot object navigation (ZSON) (Gadre et al 2022;Majumdar et al 2022). Despite these recent advancements, our study diverges in its concentration on converting visual scene semantics into input prompts for the LLM, directly performing VLN based on the commonsense knowledge and reasoning ability of LLMs.…”
Section: Llms In Robotics Navigationmentioning
confidence: 99%
“…Note that the latter depends on making correct associations between landmarks and observations, which is known as the correspondence problem in SLAM, and is typically implicit in the notation used to describe algorithms. A single robot state x t is generally a pose-a concatenation of position (translation) t and orientation (rotation) R. t may belong to R 2 or R 3 for position in 2D or 3D space, respectively, while the corresponding R is typically a matrix in SO (2) or SO(3)-the special orthogonal groups consisting of orthogonal 2 × 2 and 3 × 3 matrices with determinant 1. The Cartesian product R n × SO(n) then constitutes the n-dimensional special Euclidean group SE(n), with composition:…”
Section: Full Slammentioning
confidence: 99%
“…The localization of objects in the robot's immediate surroundings once there is performed by learned policies conditioned on sensory observations. It is not hard to imagine that this kind of system could be combined with something like the approaches demonstrated in [2,85]. Given a sequence of navigation instructions in natural language, the former has an LLM to parse out the landmark descriptors, which are compared against sensory observations by a vision-language model (VLM), to produce a topological path through a visual-navigation model (VNM).…”
Section: Robot Navigation Without the Construction Of Mapsmentioning
confidence: 99%
See 1 more Smart Citation
“…Large language models have been explored as an approach to high-level planning [14]- [18] and scene understanding [19], [20]. Vision-language models embedding image features into the same space as text have been applied to open vocabulary object detection [16], [17], natural language maps [15], [17], [21]- [23], and for language-informed navigation [24]- [26].…”
Section: Language Models In Roboticsmentioning
confidence: 99%