Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing 2018
DOI: 10.18653/v1/d18-1287
|View full text |Cite
|
Sign up to set email alerts
|

Mapping Instructions to Actions in 3D Environments with Visual Goal Prediction

Abstract: We propose to decompose instruction execution to goal prediction and action generation. We design a model that maps raw visual observations to goals using LINGUNET, a language-conditioned image generation network, and then generates the actions required to complete them. Our model is trained from demonstration only without external resources. To evaluate our approach, we introduce two benchmarks for instruction following: LANI, a navigation task; and CHAI, where an agent executes household instructions. Our ev… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
132
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
5
3

Relationship

2
6

Authors

Journals

citations
Cited by 125 publications
(133 citation statements)
references
References 30 publications
1
132
0
Order By: Relevance
“…when being integrated into a new environment). Our work is most closely related to recent advances in instruction following and visual attention [2], [3], but we do not provide explicit supervision for object detections or classifications. Finally, we will make the assumption that goals are specified by a simple list of object IDs, so as to avoid the ambiguity introduced by natural language commands.…”
Section: … …mentioning
confidence: 99%
“…when being integrated into a new environment). Our work is most closely related to recent advances in instruction following and visual attention [2], [3], but we do not provide explicit supervision for object detections or classifications. Finally, we will make the assumption that goals are specified by a simple list of object IDs, so as to avoid the ambiguity introduced by natural language commands.…”
Section: … …mentioning
confidence: 99%
“…32] and grounded question answering [3,13,34]. Recently, the problem has been studied in interactive simulated environments where the visual input changes as the agent acts, such as interactive question answering [9, 12, ] and instruction following [25,26]. In contrast, we focus on an interactive environment with real-world observations.…”
Section: Related Work and Datasetsmentioning
confidence: 99%
“…We perform linguistically-driven analysis to two additional navigation datasets: SAIL [21,7] and LANI [25], both using simulated environments. Both datasets include paragraphs segmented into single instructions.…”
Section: B Additional Data Analysismentioning
confidence: 99%
See 1 more Smart Citation
“…By specifying the agent's task with a high-level end-goal, our setup does not assume the requester knows how to accomplish the task before requesting it. This aspect, along with the agent-advisor interaction, distinguishes our setup from instruction-following setups [2,45,44,6,12,13], in which the requester provides the agent with detailed sequential steps to execute a task only at the beginning. Constraint formulation.…”
Section: Vision-based Navigation With Languagebased Assistancementioning
confidence: 99%