HOP: History-and-Order Aware Pretraining for Vision-and-Language Navigation

Qiao, Yanyuan; Qi, Yuankai; Hong, Yicong; Zheng, Yu; Wang, Peng; Wu, Qi

doi:10.1109/cvpr52688.2022.01498

Cited by 41 publications

(18 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…They explore diverse training strategies [84,83], mine extra supervisory signals from synthesized samples [27,71,28] or auxiliary tasks [83,35,53,93,78], and explore intelligent path planning [39,54,81]. For structured and long-range context modeling, recent solutions were developed with environment map [92,13,21,80], transformer architectures [33,61,48,64,11], and multimodal pretraining [56,31,30,12].…”

Section: Related Workmentioning

confidence: 99%

“…Later, [30,31,12] conduct pretraining on abundant web image-captions [30] or synthesized trajectory-instruction pairs [31,12] with different VLN-specific proxy tasks. [11,64] introduce historyaware proxy tasks for more VLN-aligned pretraining.…”

Section: Related Workmentioning

confidence: 99%

“…Teacher-forcing [89] is used here to enable the parallel text input. Worth mentioning is that, existing VLN pretraining methods [30,31,30,11,64] rely on the masked language modeling (MLM) strategy. Since MLM only predicts a small portion (typically 15%) of input words during each training iteration, it is less efficient for large-scale pretraining data, as pointed out by many recent literature in general vision-language pretraining [34,9,15].…”

Section: Network Trainingmentioning

confidence: 99%

“…Training. Following recent VLN practice [56,31,30,11,64], the pretraining and fine-tuning paradigm is adopted: • Pretraining: With the two training objectives (cf. Eq.…”

Section: Implementation Detailsmentioning

confidence: 99%

“…During finetuning, the sampling ratio for IG and IF is set to IG:IF=2:5; the ITM task is abandoned. Following the common practice [11,33,64], we concatenate the object features with the panoramic features and add an object grounding loss for the instruction following task on REVERIE [63]. The detailed architecture of LANA is shown in Table 7.…”

Section: Implementation Details Of Lanamentioning

confidence: 99%

See 4 more Smart Citations

Lana: A Language-Capable Navigator for Instruction Following and Generation

Wang¹,

Yang²,

Shao³

et al. 2023

Preprint

View full text Add to dashboard Cite

Recently, visual-language navigation (VLN) -entailing robot agents to follow navigation instructions -has shown great advance. However, existing literature put most emphasis on interpreting instructions into actions, only delivering "dumb" wayfinding agents. In this article, we devise LANA, a language-capable navigation agent which is able to not only execute human-written navigation commands, but also provide route descriptions to humans. This is achieved by simultaneously learning instruction following and generation with only one single model. More specifically, two encoders, respectively for route and language encoding, are built and shared by two decoders, respectively for action prediction and instruction generation, so as to exploit cross-task knowledge and capture task-specific characteristics. Throughout pretraining and fine-tuning, both instruction following and generation are set as optimization objectives. We empirically verify that, compared with recent advanced task-specific solutions, LANA attains better performances on both instruction following and route description, with nearly half complexity. In addition, endowed with language generation capability, LANA can explain to human its behaviours and assist human's wayfinding. This work is expected to foster future efforts towards building more trustworthy and sociallyintelligent navigation robots.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Network Trainingmentioning

confidence: 99%

“…Training. Following recent VLN practice [56,31,30,11,64], the pretraining and fine-tuning paradigm is adopted: • Pretraining: With the two training objectives (cf. Eq.…”

Section: Implementation Detailsmentioning

confidence: 99%

Section: Implementation Details Of Lanamentioning

confidence: 99%

See 3 more Smart Citations

Lana: A Language-Capable Navigator for Instruction Following and Generation

Wang¹,

Yang²,

Shao³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

Learning Disentanglement with Decoupled Labels for Vision-Language Navigation

Cheng

Dong

Khan

2022

Lecture Notes in Computer Science

View full text Add to dashboard Cite

ESceme: Vision-and-Language Navigation with Episodic Scene Memory

Zheng,

Liu,

Wang

et al. 2024

Int J Comput Vis

View full text Add to dashboard Cite

Vision-and-language navigation (VLN) simulates a visual agent that follows natural-language navigation instructions in real-world scenes. Existing approaches have made enormous progress in navigation in new environments, such as beam search, pre-exploration, and dynamic or hierarchical history encoding. To balance generalization and efficiency, we resort to memorizing visited scenarios apart from the ongoing route while navigating. In this work, we introduce a mechanism of Episodic Scene memory (ESceme) for VLN that wakes an agent’s memories of past visits when it enters the current scene. The episodic scene memory allows the agent to envision a bigger picture of the next prediction. This way, the agent learns to utilize dynamically updated information instead of merely adapting to the current observations. We provide a simple yet effective implementation of ESceme by enhancing the accessible views at each location and progressively completing the memory while navigating. We verify the superiority of ESceme on short-horizon (R2R), long-horizon (R4R), and vision-and-dialog (CVDN) VLN tasks. Our ESceme also wins first place on the CVDN leaderboard. Code is available: https://github.com/qizhust/esceme.

show abstract

HOP: History-and-Order Aware Pretraining for Vision-and-Language Navigation

Cited by 41 publications

References 24 publications

Lana: A Language-Capable Navigator for Instruction Following and Generation

Lana: A Language-Capable Navigator for Instruction Following and Generation

Learning Disentanglement with Decoupled Labels for Vision-Language Navigation

ESceme: Vision-and-Language Navigation with Episodic Scene Memory

Contact Info

Product

Resources

About