2021
DOI: 10.3390/s21031012
|View full text |Cite
|
Sign up to set email alerts
|

Joint Multimodal Embedding and Backtracking Search in Vision-and-Language Navigation

Abstract: Due to the development of computer vision and natural language processing technologies in recent years, there has been a growing interest in multimodal intelligent tasks that require the ability to concurrently understand various forms of input data such as images and text. Vision-and-language navigation (VLN) require the alignment and grounding of multimodal input data to enable real-time perception of the task status on panoramic images and natural language instruction. This study proposes a novel deep neura… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
8
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
1
1

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(8 citation statements)
references
References 18 publications
0
8
0
Order By: Relevance
“…There are many studies on language directed navigation [38], [72], [73], [74], [75], [76], [77], [78], [79], [80], [81], [82], [83]. In the language-directed navigation task, there are many benchmark tasks that have been proposed as visuallanguage-navigation (VLN) (R2R [8] , R4R [34], REVERIE [37], NDH [35], HANNNA [36], Robo-VLN [38] ).…”
Section: Visual Language Navigationmentioning
confidence: 99%
See 1 more Smart Citation
“…There are many studies on language directed navigation [38], [72], [73], [74], [75], [76], [77], [78], [79], [80], [81], [82], [83]. In the language-directed navigation task, there are many benchmark tasks that have been proposed as visuallanguage-navigation (VLN) (R2R [8] , R4R [34], REVERIE [37], NDH [35], HANNNA [36], Robo-VLN [38] ).…”
Section: Visual Language Navigationmentioning
confidence: 99%
“…Next, we describe a model that focuses on pre-training in VLN. Various models [79], [80], [81], [82] use pre-training and fine-tuning for VLN tasks. It has been used to solve the problem of limited amount of data in VLN.…”
Section: Visual Language Navigationmentioning
confidence: 99%
“…To improve the accuracy of visual language navigation tasks, many scholars (Fried et al, 2018 ; Ma et al, 2019 ; Wang et al, 2019 ; Majumdar et al, 2020 ; Zhu F. et al, 2020 ; Hwang and Kim, 2021 ; Lianbo et al, 2021a , b ) have made lots of contribution. The Speaker–Follower model is proposed by Fried.…”
Section: Related Workmentioning
confidence: 99%
“…Considering many scholars have proposed different multimodal fusion methods (Fried et al, 2018;Landi et al, 2019;Hwang and Kim, 2021), it still can be improved. There are two main reasons as follows: (1) The recent fusion methods cannot accurately work on the relatively important features of visual and textual information, including landmark objects under the scene and landmark directions under the instructions;…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation