2018
DOI: 10.48550/arxiv.1809.00786
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Mapping Instructions to Actions in 3D Environments with Visual Goal Prediction

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
19
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
6
1
1

Relationship

0
8

Authors

Journals

citations
Cited by 13 publications
(21 citation statements)
references
References 0 publications
0
19
0
Order By: Relevance
“…[52] tackle 'interactive navigation', where the robot can bump into and push objects during navigation, but does not have an arm. Some works [56][57][58] abstract away gross motor control entirely by using symbolic interaction capabilities (e.g. a 'pick up X' action) or a 'magic pointer' [9].…”
Section: Related Workmentioning
confidence: 99%
“…[52] tackle 'interactive navigation', where the robot can bump into and push objects during navigation, but does not have an arm. Some works [56][57][58] abstract away gross motor control entirely by using symbolic interaction capabilities (e.g. a 'pick up X' action) or a 'magic pointer' [9].…”
Section: Related Workmentioning
confidence: 99%
“…Natural language, the most common modality for human-human and human-robot communication, can realize the grounding in various ways. For communication with robots, language can be interpreted from instructional commands to actions [29,30]. For static images or texts, it can be either visually grounded [31,32] or text-based [33] Q&A.…”
Section: Language Groundingmentioning
confidence: 99%
“…End to end learning approaches Several recent deep learning approaches propose to learn a mapping directly from inputs to actions, whether structured observations are provided [22,33] or the agent deals with raw visual observations [25,43]. Cross modal grounding of language instructions to visual observations is often used in several works, via e.g., reinforcement learning [38,37], autoencoder architectures that impose a language instructions-based heat map on the visual observations (using U-net architectures [24], attention mechanisms [49], or implementation of non linear differentiable filters [2]). However, as we show later in results, going end to end may not be best for generalizing to perform compositional tasks in unseen environments.…”
Section: Related Workmentioning
confidence: 99%
“…Using such keywords from language to identify regions in image is known as referring expression image segmentation (e.g.- [44]). We propose a simple restructuring of language inputs to LingUNet [24] and retrain it conditioned on semantic word labels provided by the language understanding module.…”
Section: Overview Of Movilan Frameworkmentioning
confidence: 99%
See 1 more Smart Citation