Joint Multimodal Embedding and Backtracking Search in Vision-and-Language Navigation

Hwang, Jisu; Kim, In-Cheol

doi:10.3390/s21031012

Cited by 2 publications

(8 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…There are many studies on language directed navigation [38], [72], [73], [74], [75], [76], [77], [78], [79], [80], [81], [82], [83]. In the language-directed navigation task, there are many benchmark tasks that have been proposed as visuallanguage-navigation (VLN) (R2R [8] , R4R [34], REVERIE [37], NDH [35], HANNNA [36], Robo-VLN [38] ).…”

Section: Visual Language Navigationmentioning

confidence: 99%

See 1 more Smart Citation

Survey on Multimodal Transformers for Robots

Miyazawa¹,

Nagai²

2023

Preprint

View full text Add to dashboard Cite

<p>In recent years, transformers have been attracting considerable attention in various natural language processing tasks. Recently, they have been used not only in natural language processes, but also for processing multimodal data such as images, video, and audio, and their effectiveness has been demonstrated. The processing of multimodal data is extremely important in robot intelligence. Therefore, the multimodal transformers have the potential to contribute to the development of robotics in various domains. In this paper, we review the application of transformers to robots and discuss the possibility of transformers solving the problems in current intelligent robotics.</p>

show abstract

Section: Visual Language Navigationmentioning

confidence: 99%

“…Next, we describe a model that focuses on pre-training in VLN. Various models [79], [80], [81], [82] use pre-training and fine-tuning for VLN tasks. It has been used to solve the problem of limited amount of data in VLN.…”

Section: Visual Language Navigationmentioning

confidence: 99%

Survey on Multimodal Transformers for Robots

Miyazawa¹,

Nagai²

2023

Preprint

View full text Add to dashboard Cite

show abstract

“…To improve the accuracy of visual language navigation tasks, many scholars (Fried et al, 2018 ; Ma et al, 2019 ; Wang et al, 2019 ; Majumdar et al, 2020 ; Zhu F. et al, 2020 ; Hwang and Kim, 2021 ; Lianbo et al, 2021a , b ) have made lots of contribution. The Speaker–Follower model is proposed by Fried.…”

Section: Related Workmentioning

confidence: 99%

“…Considering many scholars have proposed different multimodal fusion methods (Fried et al, 2018;Landi et al, 2019;Hwang and Kim, 2021), it still can be improved. There are two main reasons as follows: (1) The recent fusion methods cannot accurately work on the relatively important features of visual and textual information, including landmark objects under the scene and landmark directions under the instructions;…”

Section: Introductionmentioning

confidence: 99%

“…Considering many scholars have proposed different multi-modal fusion methods (Fried et al, 2018 ; Landi et al, 2019 ; Hwang and Kim, 2021 ), it still can be improved. There are two main reasons as follows: (1) The recent fusion methods cannot accurately work on the relatively important features of visual and textual information, including landmark objects under the scene and landmark directions under the instructions; (2) Facing with complex indoor environment, the internal connections among pairs of perceptual inputs cannot reflect a large amount of useful and helpful information, when these connections are ignored.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Vital information matching in vision-and-language navigation

Jia

et al. 2022

Front. Neurorobot.

View full text Add to dashboard Cite

With the rapid development of artificial intelligence technology, many researchers have begun to focus on visual language navigation, which is one of the most important tasks in multi-modal machine learning. The focus of this multi-modal field is how to fuse multiple inputs, which is crucial for the integrated feedback of intrinsic information. However, the existing models are only implemented through simple data augmentation or expansion, and are obviously far from being able to tap the intrinsic relationship between modalities. In this paper, to overcome these challenges, a novel multi-modal matching feedback self-tuning model is proposed, which is a novel neural network called Vital Information Matching Feedback Self-tuning Network (VIM-Net). Our VIM-Net network is mainly composed of two matching feedback modules, a visual matching feedback module (V-mat) and a trajectory matching feedback module (T-mat). Specifically, V-mat matches the target information of visual recognition with the entity information extracted by the command; T-mat matches the serialized trajectory feature with the direction of movement of the command. Ablation experiments and comparative experiments are conducted on the proposed model using the Matterport3D simulator and the Room-to-Room (R2R) benchmark datasets, and the final navigation effect is shown in detail. The results prove that the model proposed in this paper is indeed effective on the task.

show abstract

Joint Multimodal Embedding and Backtracking Search in Vision-and-Language Navigation

Cited by 2 publications

References 18 publications

Survey on Multimodal Transformers for Robots

Survey on Multimodal Transformers for Robots

Vital information matching in vision-and-language navigation

Contact Info

Product

Resources

About