Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence 2020
DOI: 10.24963/ijcai.2020/124
|View full text |Cite
|
Sign up to set email alerts
|

Diagnosing the Environment Bias in Vision-and-Language Navigation

Abstract: Vision-and-Language Navigation (VLN) requires an agent to follow natural-language instructions, explore the given environments, and reach the desired target locations. These step-by-step navigational instructions are crucial when the agent is navigating new environments about which it has no prior knowledge. Most recent works that study VLN observe a significant performance drop when tested on unseen environments (i.e., environments not used in training), indicating that the neural agent models are hig… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
32
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 38 publications
(32 citation statements)
references
References 2 publications
0
32
0
Order By: Relevance
“…Therefore, we believe that visual differences should be learned by understanding and comparing every single image's semantic representation. A most recent work (Zhang et al, 2020) conceptually supports this argument, where they show that low-level ResNet visual features lead to poor generalization in vision-and-language navigation, and high-level semantic segmentation helps the agent Image I 1…”
Section: Introductionmentioning
confidence: 93%
“…Therefore, we believe that visual differences should be learned by understanding and comparing every single image's semantic representation. A most recent work (Zhang et al, 2020) conceptually supports this argument, where they show that low-level ResNet visual features lead to poor generalization in vision-and-language navigation, and high-level semantic segmentation helps the agent Image I 1…”
Section: Introductionmentioning
confidence: 93%
“…Visual-and-language navigation (VLN) [87,88,[118][119][120][121] is a multimodal task that has become increasingly popular in recent years. The idea behind VLN is to combine several active domains (i.e., natural language, vision, and action) to enable robots (intelligent agents) to navigate easily in unstructured environments.…”
Section: Vision-and-language Navigationmentioning
confidence: 99%
“…Many models have been developed, such as the Speaker-Follower model (Fried et al, 2018), the Self-Monitoring Navigation Agent (Ma et al, 2019a;Ke et al, 2019), the Regretful Agent (Ma et al, 2019b), and the environment drop-out model (Tan et al, 2019). The VLN benchmark is further extended to study the fidelity of instruction following (Jain et al, 2019) and examined to understand the bias of the benchmark (Zhang et al, 2020). Beyond navigation, there are also benchmarks that additionally incorporate object manipulation to broaden research on vision and language reasoning, such as embodied question answering (Das et al, 2018a;Gordon et al, 2018).…”
Section: Related Workmentioning
confidence: 99%