2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021
DOI: 10.1109/cvpr46437.2021.00835
|View full text |Cite
|
Sign up to set email alerts
|

Structured Scene Memory for Vision-Language Navigation

Abstract: Vision-language navigation is a task that requires an agent to follow instructions to navigate in environments. It becomes increasingly crucial in the field of embodied AI, with potential applications in autonomous navigation, search and rescue, and human-robot interaction. In this paper, we propose to address a more practical yet challenging counterpart setting -vision-language navigation in continuous environments (VLN-CE). To develop a robust VLN-CE agent, we propose a new navigation framework, ETPNav, whic… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
45
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 86 publications
(54 citation statements)
references
References 89 publications
1
45
0
Order By: Relevance
“…Exploration and language grounding are two essential abilities for VLN agents. However, existing works either only allow for local actions A t [13][14][15] which hinders long-range action planning, or lack object representations O t [8,19,20] which might be insufficient for fine-grained grounding. Our work addresses both issues with a dualscale representation and global action planning.…”
Section: Methodsmentioning
confidence: 99%
See 3 more Smart Citations
“…Exploration and language grounding are two essential abilities for VLN agents. However, existing works either only allow for local actions A t [13][14][15] which hinders long-range action planning, or lack object representations O t [8,19,20] which might be insufficient for fine-grained grounding. Our work addresses both issues with a dualscale representation and global action planning.…”
Section: Methodsmentioning
confidence: 99%
“…Therefore, several works [38,39] propose to represent the map as topological structures for pre-exploring environments [40], or for back-tracking to other locations, tradingoff navigation accuracy with the path length [10,24]. A few recent VLN works [8,19,20] used topological maps to support global action planning, but they suffer from using recurrent architectures for state tracking and also lack a fine-scale representation for language grounding as shown in Figure 2. We address the above limitations via a dualscale graph transformer with topological maps.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…Another line of research was trying to improve the unseen generalization through either pre-training [13,10,23], auxiliary supervision [21,32,33,29], or training-data processing [26,9,24]. The basic Seq2Seq structure had also been improved by introducing cross-modal attention [30] and fine-grained relationship [12], utilizing the semantic or syntactic information of languages [25,19], reformulating under a Bayesian framework [2], and combining longrange memory for global decision [6,27]. Besides R2R, some later works also proposed more challenging datasets like Room-for-Room (R4R) [15], TOUCHDOWN [4], and Room-across-Room [17] (RxR).…”
Section: Related Workmentioning
confidence: 99%