2023
DOI: 10.1109/tpami.2023.3234243
|View full text |Cite
|
Sign up to set email alerts
|

HOP+: History-Enhanced and Order-Aware Pre-Training for Vision-and-Language Navigation

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
6
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
7
1
1

Relationship

0
9

Authors

Journals

citations
Cited by 25 publications
(6 citation statements)
references
References 35 publications
0
6
0
Order By: Relevance
“…Recently, transformer-based models have shown superior performance thanks to their powerful ability to learn generic multi-modal representations [43], [44], [45]. This scheme is further extended by recurrent agent state [4], [46], [47], episodic memory [5], [48], [49], graph memory [15], [50], [51] and prompt learning [52], [53] that significantly improves sequential action prediction.…”
Section: Vision-language Navigationmentioning
confidence: 99%
“…Recently, transformer-based models have shown superior performance thanks to their powerful ability to learn generic multi-modal representations [43], [44], [45]. This scheme is further extended by recurrent agent state [4], [46], [47], episodic memory [5], [48], [49], graph memory [15], [50], [51] and prompt learning [52], [53] that significantly improves sequential action prediction.…”
Section: Vision-language Navigationmentioning
confidence: 99%
“…The research on VLN is dedicated to addressing the alignment of linguistic instructions with visual cues and actions, some work fine-graining the navigation instructions to achieve sub-goal planning (Hong et al 2020;He et al 2021a;Zhu et al 2020), and some concentrate on utilizing object information to identify landmarks from observations (Gao et al 2021;Qi et al 2020aQi et al , 2021. Temporal information is specifically designed in (Hao et al 2020;Hong et al 2021;Chen et al 2021Chen et al , 2022Qiao et al 2022Qiao et al , 2023Zhao, Qi, and Wu 2023) to capture long-range dependencies across past observations and actions, which are crucial during navigation. some methods incorporate external knowledge during navigation (Li et al 2022;Gao et al 2021).…”
Section: Related Workmentioning
confidence: 99%
“…utilizing common-sense knowledge to address the remote embodied visual referring expression in real indoor environments (REVERIE) task [37]. Furthermore, Qiao et al [38] proposed a history-enhanced and order-aware pretraining with the complementing fine-tuning paradigm (HOP+) for visionand-language navigation. The regions of interest (ROI) are extracted from an image modality as image features using the Fast R-CNN algorithm [39].…”
Section: A Vision and Language Pretrained Modelsmentioning
confidence: 99%