2017
DOI: 10.48550/arxiv.1711.07280
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
9
0

Year Published

2017
2017
2024
2024

Publication Types

Select...
4
2

Relationship

0
6

Authors

Journals

citations
Cited by 8 publications
(9 citation statements)
references
References 0 publications
0
9
0
Order By: Relevance
“…To show that the scheduled mechanism is able to provide general improvements, we compare our scheduled RL with vanilla RL and a mix-loss [Ranzato et al, 2015] method on this dataset. We use a similar network architecture as in [Anderson et al, 2017]. Instead of training the agent using only a crossentropy loss to imitate demonstration actions, we introduce a distance-based reward.…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…To show that the scheduled mechanism is able to provide general improvements, we compare our scheduled RL with vanilla RL and a mix-loss [Ranzato et al, 2015] method on this dataset. We use a similar network architecture as in [Anderson et al, 2017]. Instead of training the agent using only a crossentropy loss to imitate demonstration actions, we introduce a distance-based reward.…”
Section: Discussionmentioning
confidence: 99%
“…Figure 6: Distance error evaluated on the unseen development scenes of Room-to-Room environment.More recently, a new dataset[Anderson et al, 2017] with realistic indoor scenes has been released. This dataset (Room-to-Room) includes 21,567 crowd-sourced natural language instructions and 10,800 panoramic RGB-D images.…”
mentioning
confidence: 99%
“…The AI community has built numerous platforms to drive algorithmic advances: the Arcade Learning Environment [13], OpenAI Universe [27], Minecraft-based Malmo [28], maze-based Deep-Mind Lab [29], Doom-based ViZDoom [30], AI2-THOR [31], Matterport3D Simulator [32] and House3D [33]. Several of these environments were created to be powerful 3D sandboxes for developing learning algorithms [28,29,30], while HoME additionally aims to provide a unified platform for multimodal learning in a realistic context (Fig.…”
Section: Related Workmentioning
confidence: 99%
“…In addition to being easier to collect, the firstperson perspective captures rich information about the object appearance, as well as the relationships and interactions between the ego-vehicle and objects in the environment. Due to these advantages, egocentric videos have been directly used in applications such as action recognition [5], [6], navigation [7]- [9], and end-to-end autonomous driving [10]. For trajectory prediction, some work has simulated bird's eye views by projecting egocentric video frames onto the ground plane [2], [3], but these projections can be incorrect due to road irregularities or other sources of distortion, which prevent accurate vehicle position prediction.…”
Section: Introductionmentioning
confidence: 99%