Visual Language Maps for Robot Navigation

Huang, Chenguang; Mees, Oier; Zeng, Andy; Burgard, Wolfram

doi:10.48550/arxiv.2210.05714

Cited by 7 publications

(10 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Correspondingly, the encoder-decoder network needs to be modified such that it is suitable for dealing with these modified inputs and outputs (see details in Supplementary Note 1). We represent the subjects with 19-point 3D skeletons and model the indoor environment with a 3D visual-semantic map such that a given coordinate (e.g., (1.3 m, 1.2 m, 0.8 m)) can be associated with a semantic coordinate (e.g., "at the left side of the chair") 58 (see Fig. 6a and Supplementary Note 10).…”

Section: Experimental Results For 4d Compressive Microwave Metaimagingmentioning

confidence: 99%

Semantic regularization of electromagnetic inverse problems

Zhang,

Chen,

Wang

et al. 2024

Nat Commun

View full text Add to dashboard Cite

Solving ill-posed inverse problems typically requires regularization based on prior knowledge. To date, only prior knowledge that is formulated mathematically (e.g., sparsity of the unknown) or implicitly learned from quantitative data can be used for regularization. Thereby, semantically formulated prior knowledge derived from human reasoning and recognition is excluded. Here, we introduce and demonstrate the concept of semantic regularization based on a pre-trained large language model to overcome this vexing limitation. We study the approach, first, numerically in a prototypical 2D inverse scattering problem, and, second, experimentally in 3D and 4D compressive microwave imaging problems based on programmable metasurfaces. We highlight that semantic regularization enables new forms of highly-sought privacy protection for applications like smart homes, touchless human-machine interaction and security screening: selected subjects in the scene can be concealed, or their actions and postures can be altered in the reconstruction by manipulating the semantic prior with suitable language-based control commands.

show abstract

Section: Experimental Results For 4d Compressive Microwave Metaimagingmentioning

confidence: 99%

Semantic regularization of electromagnetic inverse problems

Zhang,

Chen,

Wang

et al. 2024

Nat Commun

View full text Add to dashboard Cite

show abstract

“…A handful of contemporary studies, however, have begun to explore the utilization of generative models for navigation. Shah et al (Shah et al 2023) employs GPT-3 (Brown et al 2020) in an attempt to identify "landmarks" or subgoals, while Huang et al (Huang et al 2022) concentrates its efforts on the application of an LLM for the generation of code. Zhou et al (Zhou et al 2023) use LLM to extract the commonsense knowledge of the relations between targets and objects in observations to perform zero-shot object navigation (ZSON) (Gadre et al 2022;Majumdar et al 2022).…”

Section: Llms In Robotics Navigationmentioning

confidence: 99%

NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models

Zhou,

Hong,

2024

AAAI

View full text Add to dashboard Cite

Trained with an unprecedented scale of data, large language models (LLMs) like ChatGPT and GPT-4 exhibit the emergence of significant reasoning abilities from model scaling. Such a trend underscored the potential of training LLMs with unlimited language data, advancing the development of a universal embodied agent. In this work, we introduce the NavGPT, a purely LLM-based instruction-following navigation agent, to reveal the reasoning capability of GPT models in complex embodied scenes by performing zero-shot sequential action prediction for vision-and-language navigation (VLN). At each step, NavGPT takes the textual descriptions of visual observations, navigation history, and future explorable directions as inputs to reason the agent's current status, and makes the decision to approach the target. Through comprehensive experiments, we demonstrate NavGPT can explicitly perform high-level planning for navigation, including decomposing instruction into sub-goals, integrating commonsense knowledge relevant to navigation task resolution, identifying landmarks from observed scenes, tracking navigation progress, and adapting to exceptions with plan adjustment. Furthermore, we show that LLMs is capable of generating high-quality navigational instructions from observations and actions along a path, as well as drawing accurate top-down metric trajectory given the agent's navigation history. Despite the performance of using NavGPT to zero-shot R2R tasks still falling short of trained models, we suggest adapting multi-modality inputs for LLMs to use as visual navigation agents and applying the explicit reasoning of LLMs to benefit learning-based models. Code is available at: https://github.com/GengzeZhou/NavGPT.

show abstract

“…Large language models have been explored as an approach to high-level planning [14]- [18] and scene understanding [19], [20]. Vision-language models embedding image features into the same space as text have been applied to open vocabulary object detection [16], [17], natural language maps [15], [17], [21]- [23], and for language-informed navigation [24]- [26].…”

Section: Language Models In Roboticsmentioning

confidence: 99%

“…Such methods decompose the scene into a graph where edges model relations between parts of the scene. The geometry of the parts are typically represented as a signed distance functions stored in a voxel grid [15].…”

Section: Semantic Scene Representationsmentioning

confidence: 99%

Neural Implicit Vision-Language Feature Fields

Blomqvist¹,

Milano²,

Chung³

et al. 2023

Preprint

View full text Add to dashboard Cite

Recently, groundbreaking results have been presented on open-vocabulary semantic image segmentation. Such methods segment each pixel in an image into arbitrary categories provided at run-time in the form of text prompts, as opposed to a fixed set of classes defined at training time. In this work, we present a zero-shot volumetric open-vocabulary semantic scene segmentation method. Our method builds on the insight that we can fuse image features from a visionlanguage model into a neural implicit representation. We show that the resulting feature field can be segmented into different classes by assigning points to natural language text prompts. The implicit volumetric representation enables us to segment the scene both in 3D and 2D by rendering feature maps from any given viewpoint of the scene. We show that our method works on noisy real-world data and can run in real-time on live sensor data dynamically adjusting to text prompts. We also present quantitative comparisons on the ScanNet dataset.

show abstract

Visual Language Maps for Robot Navigation

Cited by 7 publications

References 37 publications

Semantic regularization of electromagnetic inverse problems

Semantic regularization of electromagnetic inverse problems

NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models

Neural Implicit Vision-Language Feature Fields

Contact Info

Product

Resources

About