Translating Navigation Instructions in Natural Language to a High-Level Plan for Behavioral Robot Navigation

Zang, Xiaoxue; Ashwini, Pokle,; Vázquez, Marynel; Chen, Kevin; Niebles, Juan Carlos; Soto, Álvaro; Savarese, Silvio

doi:10.18653/v1/d18-1286

Cited by 23 publications

(18 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Shah et al [15] utilized attention over linguistic instructions conditioned on the multi-modal sensory observations to focus on the relevant parts of the command during navigation task. [16] approach the language-based navigation task as a sequence prediction problem. They translate navigation instructions into a sequence of behaviours that a robot can execute to reach the desired destination.…”

Section: Language Based Navigationmentioning

confidence: 99%

Grounding Linguistic Commands to Navigable Regions

Rufus,

Jain,

Nair

et al. 2021

Preprint

View full text Add to dashboard Cite

Humans have a natural ability to effortlessly comprehend linguistic commands such as "park next to the yellow sedan" and instinctively know which region of the road the vehicle should navigate. Extending this ability to autonomous vehicles is the next step towards creating fully autonomous agents that respond and act according to human commands. To this end, we propose the novel task of Referring Navigable Regions (RNR), i.e., grounding regions of interest for navigation based on the linguistic command. RNR is different from Referring Image Segmentation (RIS), which focuses on grounding an object referred to by the natural language expression instead of grounding a navigable region. For example, for a command "park next to the yellow sedan," RIS will aim to segment the referred sedan, and RNR aims to segment the suggested parking region on the road. We introduce a new dataset, Talk2Car-RegSeg, which extends the existing Talk2car [1] dataset with segmentation masks for the regions described by the linguistic commands. A separate test split with concise manoeuvre-oriented commands is provided to assess the practicality of our dataset. We benchmark the proposed dataset using a novel transformer-based architecture. We present extensive ablations and show superior performance over baselines on multiple evaluation metrics. A downstream path planner generating trajectories based on RNR outputs confirms the efficacy of the proposed framework.

show abstract

Section: Language Based Navigationmentioning

confidence: 99%

Grounding Linguistic Commands to Navigable Regions

Rufus,

Jain,

Nair

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…In the literature, a few datasets have been created for similar tasks Chen et al 2019;de Vries et al 2018;Zang et al 2018). For instance, annotated the language description for the route by asking the user to navigate the entire path in egocentric perspective.…”

Section: Annotation and Dataset Statisticsmentioning

confidence: 99%

“…For instance, annotated the language description for the route by asking the user to navigate the entire path in egocentric perspective. Incorporation of overhead map of navigated route as an aid for describing the route can be seen in Chen et al (2019);de Vries et al (2018); Zang et al (2018).…”

Section: Annotation and Dataset Statisticsmentioning

confidence: 99%

Talk2Nav: Long-Range Vision-and-Language Navigation with Dual Attention and Spatial Memory

2020

View full text Add to dashboard Cite

The role of robots in society keeps expanding, bringing with it the necessity of interacting and communicating with humans. In order to keep such interaction intuitive, we provide automatic wayfinding based on verbal navigational instructions. Our first contribution is the creation of a large-scale dataset with verbal navigation instructions. To this end, we have developed an interactive visual navigation environment based on Google Street View; we further design an annotation method to highlight mined anchor landmarks and local directions between them in order to help annotators formulate typical, human references to those. The annotation task was crowdsourced on the AMT platform, to construct a new Talk2Nav dataset with 10, 714 routes. Our second contribution is a new learning method. Inspired by spatial cognition research on the mental conceptualization of navigational instructions, we introduce a soft dual attention mechanism defined over the segmented language instructions to jointly extract two partial instructions—one for matching the next upcoming visual landmark and the other for matching the local directions to the next landmark. On the similar lines, we also introduce spatial memory scheme to encode the local directional transitions. Our work takes advantage of the advance in two lines of research: mental formalization of verbal navigational instructions and training neural network agents for automatic way finding. Extensive experiments show that our method significantly outperforms previous navigation methods. For demo video, dataset and code, please refer to our project page.

show abstract

“…Most strategies are based on imitation learning, relying on expert demonstrations and knowledge from the environment. For example, [42] relate instructions to an environment graph, requiring both demonstrations and high-level navigation information. Closer to our work, [15] also learns a navigation model and an instruction generator, but the latter is used to generate additional training data for the agent.…”

Section: A Vision and Language Navigationmentioning

confidence: 99%

“…Besides, instruction following is a notoriously hard RL problem as the training signal is very sparse since the agent is only rewarded over task completion. In practice, the navigation and language grounding problems are often circumvented by warm-starting the policy with labeled trajectories [42,1]. Although scalable, these approaches require numerous human demonstrations, whereas we here want to jointly learn the navigation policy and language understanding from scratch.…”

Section: Introductionmentioning

confidence: 99%

HIGhER: Improving instruction following with Hindsight Generation for Experience Replay

Cideron

Seurin

Strub³

et al. 2020

2020 IEEE Symposium Series on Computational Intelligence (SSCI)

View full text Add to dashboard Cite

Language creates a compact representation of the world and allows the description of unlimited situations and objectives through compositionality. While these characterizations may foster instructing, conditioning or structuring interactive agent behavior, it remains an open-problem to correctly relate language understanding and reinforcement learning in even simple instruction following scenarios. This joint learning problem is alleviated through expert demonstrations, auxiliary losses, or neural inductive biases. In this paper, we propose an orthogonal approach called Hindsight Generation for Experience Replay (HIGhER) that extends the Hindsight Experience Replay approach to the language-conditioned policy setting. Whenever the agent does not fulfill its instruction, HIGhER learns to output a new directive that matches the agent trajectory, and it relabels the episode with a positive reward. To do so, HIGhER learns to map a state into an instruction by using past successful trajectories, which removes the need to have external expert interventions to relabel episodes as in vanilla HER. We show the efficiency of our approach in the BabyAI environment, and demonstrate how it complements other instruction following methods.

show abstract

Translating Navigation Instructions in Natural Language to a High-Level Plan for Behavioral Robot Navigation

Cited by 23 publications

References 26 publications

Grounding Linguistic Commands to Navigable Regions

Grounding Linguistic Commands to Navigable Regions

Talk2Nav: Long-Range Vision-and-Language Navigation with Dual Attention and Spatial Memory

HIGhER: Improving instruction following with Hindsight Generation for Experience Replay

Contact Info

Product

Resources

About