2021
DOI: 10.48550/arxiv.2106.07876
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Vision-Language Navigation with Random Environmental Mixup

Abstract: Vision-language Navigation (VLN) tasks require an agent to navigate step-by-step while perceiving the visual observations and comprehending a natural language instruction. Large data bias, which is caused by the disparity ratio between the small data scale and large navigation space, makes the VLN task challenging. Previous works have proposed various data augmentation methods to reduce data bias. However, these works do not explicitly reduce the data bias across different house scenes. Therefore, the agent wo… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
8
0

Year Published

2021
2021
2021
2021

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(8 citation statements)
references
References 41 publications
0
8
0
Order By: Relevance
“…[36] proposes an environmental dropout method based on the view consistency to mimic novel and diverse environments. From a different perspective, REM [24] reconnect the seen scenes to generate augmented data via mixing up environments. To further understand the relations between the instructions and scenes, [16] and [33] take the objects in scenes and the corresponding words in instructions as the minimal units of encoding.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…[36] proposes an environmental dropout method based on the view consistency to mimic novel and diverse environments. From a different perspective, REM [24] reconnect the seen scenes to generate augmented data via mixing up environments. To further understand the relations between the instructions and scenes, [16] and [33] take the objects in scenes and the corresponding words in instructions as the minimal units of encoding.…”
Section: Related Workmentioning
confidence: 99%
“…However, its end-to-end training requires 20 NVIDIA V100 GPUs for ∼ 20 hours, which is much higher than ours (3 V100 for 1 day). Another recent work Mixup [24] extends 61 scenes in the training set to 116 cross-connected scenes with data augmentation and thus uses more data than other methods. We leave more comparisons in Appendix.…”
Section: Comparisons With Sotamentioning
confidence: 99%
See 1 more Smart Citation
“…We apply our modification and snapshot ensemble on the VLN BERT model proposed by Hong et al (2021). 3 The model currently holds the best performance for the singlerun setting in the R2R dataset (Liu et al 2021). In this section, we will have a brief recap of this model.…”
Section: Vln Bert Modelmentioning
confidence: 99%
“…Previous studies have also made efforts to prevent overfitting due to the limited size of the R2R dataset (Fried et al 2018;Liu et al 2021;Li et al 2019;Hao et al 2020).…”
Section: Introductionmentioning
confidence: 99%