Listen, Attend, and Walk: Neural Mapping of Navigational Instructions to Action Sequences

Mei, Hongyuan; Bansal, Mohit; Walter, Matthew R.

doi:10.1609/aaai.v30i1.10364

Cited by 79 publications

(42 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Vision & Language Navigation (VLN) tasks agents with taking in language instructions and a visual observation to produce an action, such as turning or moving forward, to receive a new visual observation. VLN benchmarks have evolved from the use of symbolic environment representations (MacMahon, Stankiewicz, and Kuipers 2006;Chen and Mooney 2011;Mei, Bansal, and Walter 2016) to photorealistic indoor (Anderson et al 2018) and outdoor environments (Chen et al 2019), as well as the prediction of continuous control (Blukis et al 2018). TEACh goes beyond navigation to object interactions for task completion, and beyond single instructions to dialogue.…”

Section: Related Workmentioning

confidence: 99%

TEACh: Task-Driven Embodied Agents That Chat

Padmakumar

Thomason

Shrivastava

et al. 2022

AAAI

View full text Add to dashboard Cite

Robots operating in human spaces must be able to engage in natural language interaction, both understanding and executing instructions, and using conversation to resolve ambiguity and correct mistakes. To study this, we introduce TEACh, a dataset of over 3,000 human-human, interactive dialogues to complete household tasks in simulation. A Commander with access to oracle information about a task communicates in natural language with a Follower. The Follower navigates through and interacts with the environment to complete tasks varying in complexity from "Make Coffee" to "Prepare Breakfast", asking questions and getting additional information from the Commander. We propose three benchmarks using TEACh to study embodied intelligence challenges, and we evaluate initial models' abilities in dialogue understanding, language grounding, and task execution.

show abstract

Section: Related Workmentioning

confidence: 99%

TEACh: Task-Driven Embodied Agents That Chat

Padmakumar

Thomason

Shrivastava

et al. 2022

AAAI

View full text Add to dashboard Cite

show abstract

“…An attention mechanism (Fig. 1c) has proven to be particularly effective for various related tasks in machine translation, image caption synthesis, and language understanding (Mnih et al 2014;Bahdanau, Cho, and Bengio 2015;Xu et al 2015;Mei, Bansal, and Walter 2016a).…”

Section: Attention In Rnn-seq2seq Modelsmentioning

confidence: 99%

“…The original attention model introduced by Bahdanau, Cho, and Bengio (2015) uses the hidden units h 0:t−1 as the token representations r 0:t−1 . Recent work (Mei, Bansal, and Walter 2016a; has demonstrated that performance can be improved by using multiple abstractions of the input, e.g., r i = (E wi , h i ) , which is what we use in this work.…”

Section: Attention In Rnn-lmmentioning

confidence: 99%

“…We improve the coherence of such neural dialogue language models by developing a generative dynamic attention mechanism that allows each generated word to choose which related words it wants to align to in the increasing conversation history (including the previous words in the response being generated). Neural attention (or alignment) has proven very successful for various sequence-to-sequence tasks by associating salient items in the source sequence with the generated item in the target sequence (Mnih et al 2014;Bahdanau, Cho, and Bengio 2015;Xu et al 2015;Mei, Bansal, and Walter 2016a;Parikh et al 2016). However, such attention models are limited to a fixed scope of history, corresponding to the input source sequence.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Coherent Dialogue with Attention-Based Language Models

Mei

Bansal

Walter

2017

AAAI

Self Cite

View full text Add to dashboard Cite

We model coherent conversation continuation via RNN-based dialogue models equipped with a dynamic attention mechanism. Our attention-RNN language model dynamically increases the scope of attention on the history as the conversation continues, as opposed to standard attention (or alignment) models with a fixed input scope in a sequence-to-sequence model. This allows each generated word to be associated with the most relevant words in its corresponding conversation history. We evaluate the model on two popular dialogue datasets, the open-domain MovieTriples dataset and the closed-domain Ubuntu Troubleshoot dataset, and achieve significant improvements over the state-of-the-art and baselines on several metrics, including complementary diversity-based metrics, human evaluation, and qualitative visualizations. We also show that a vanilla RNN with dynamic attention outperforms more complex memory models (e.g., LSTM and GRU) by allowing for flexible, long-distance memory. We promote further coherence via topic modeling-based reranking.

show abstract

“…Languages, be they natural or formal, afford these desirable properties Gopnik and Meltzoff [1987]. Based on this insight, many papers have tried to leverage the abilities of language in RL to enable communication and improve generalisation and sample efficiency Andreas et al [2017], Mei et al [2016], Goyal et al [2019], Xu et al [2022]. The domain can be subdivided into language-conditioned RL (LC-RL), in which language conditions the formulation of the problemAnderson et al [2018], Goyal et al [2019], and language-assisted RL, where language helps the agent to learn Hu et al [2019], Colas et al [2020], Akakzia et al [2020], Colas et al [2022].…”

Section: Introductionmentioning

confidence: 99%

EAGER: Asking and Answering Questions for Automatic Reward Shaping in Language-guided RL

Carta¹,

Lamprier²,

Oudeyer³

et al. 2022

Preprint

View full text Add to dashboard Cite

Reinforcement learning (RL) in long horizon and sparse reward tasks is notoriously difficult and requires a lot of training steps. A standard solution to speed up the process is to leverage additional reward signals, shaping it to better guide the learning process. In the context of language-conditioned RL, the abstraction and generalisation properties of the language input provide opportunities for more efficient ways of shaping the reward. In this paper, we leverage this idea and propose an automated reward shaping method where the agent extracts auxiliary objectives from the general language goal. These auxiliary objectives use a question generation (QG) and question answering (QA) system: they consist of questions leading the agent to try to reconstruct partial information about the global goal using its own trajectory. When it succeeds, it receives an intrinsic reward proportional to its confidence in its answer. This incentivizes the agent to generate trajectories which unambiguously explain various aspects of the general language goal. Our experimental study shows that this approach, which does not require engineer intervention to design the auxiliary objectives, improves sample efficiency by effectively directing exploration.Preprint. Under review.

show abstract

Listen, Attend, and Walk: Neural Mapping of Navigational Instructions to Action Sequences

Cited by 79 publications

References 18 publications

TEACh: Task-Driven Embodied Agents That Chat

TEACh: Task-Driven Embodied Agents That Chat

Coherent Dialogue with Attention-Based Language Models

EAGER: Asking and Answering Questions for Automatic Reward Shaping in Language-guided RL

Contact Info

Product

Resources

About