Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-Training

Hao, Wen; Li, Chunyuan; Li, Xiujun; Carin, Lawrence; Gao, Jianfeng

doi:10.1109/cvpr42600.2020.01315

Cited by 203 publications

(213 citation statements)

References 16 publications

Supporting

Mentioning

213

Contrasting

Order By: Relevance

“…Another drawback of the models is the use of a recurrent neural network to model the sequence of words used in natural language instructions, which is unsuitable for parallel processing. To overcome these limitations, some researchers developed pretrained models [ 20 , 21 ] in which natural language instructions and images for the VLN task are embedded together with large-scale benchmark datasets in addition to R2R datasets. VisualBERT [ 22 ], Vision-and-Language BERT (ViLBERT) [ 23 ], Visual-Linguistic BERT (VL-BERT) [ 24 ], and UNiversal Image-TExt Representation (UNITER) [ 25 ], are pretrained models applicable to various vision–language tasks.…”

Section: Related Workmentioning

confidence: 99%

“…VisualBERT [ 22 ], Vision-and-Language BERT (ViLBERT) [ 23 ], Visual-Linguistic BERT (VL-BERT) [ 24 ], and UNiversal Image-TExt Representation (UNITER) [ 25 ], are pretrained models applicable to various vision–language tasks. There are also models pretrained specifically for VLN tasks [ 20 , 21 ]. These VLN-specific models have a simple structure that immediately selects one of the candidate actions because they use only the multimodal context of the concurrently embedded data extracted according to natural language instructions and input images.…”

Section: Related Workmentioning

confidence: 99%

“…Because most natural language instructions provide only partial descriptions of the trajectory to the target position, the agent encounters difficulties in understanding the instructions unless they are efficiently combined with visual features using the alignment and grounding feature. In VLN-related studies [ 7 , 9 , 10 , 11 , 12 , 13 , 14 , 15 , 16 , 17 , 18 , 19 , 20 , 21 ], various attention mechanisms, such as visual or textual attention, are employed to ensure the alignment and grounding between natural language instruction and input images. However, these attention-based VLN models are trained only to learn the association between natural language instructions and images with a limited number of R2R datasets; moreover, they have difficulties acquiring broader general knowledge of the relationship between natural language instruction and images.…”

Section: Introductionmentioning

confidence: 99%

“…Therefore, it is a great challenge for attention-based VLN models to extract sufficient context from multimodal input data, including natural language instructions and images, to make real-time action decisions. To address this drawback, researchers proposed transformer-based pretrained models [ 20 , 21 ]. Unlike attention-based models, pretrained models are equipped with feature aligning natural language instructions and images, trained with attended masked language modeling (AMKM) and action prediction (AP) tasks based on a transformer neural network before being used for VLN tasks.…”

Section: Introductionmentioning

confidence: 99%

“…Effective path planning and action selection strategies for reaching the target position are crucial for executing VLN tasks. Most VLN-related studies [ 7 , 9 , 10 , 11 , 12 , 13 , 14 , 15 , 16 , 17 , 18 , 19 , 20 , 21 ] applied search techniques that rely on local scoring of candidate actions, such as greedy local and beam searches. In the case of the greedy area search, the search speed is increased because the search range is limited; however, its success rate is low because it is difficult to return to the original path once a wrong path is selected.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Joint Multimodal Embedding and Backtracking Search in Vision-and-Language Navigation

Hwang¹,

Kim²

2021

Sensors

View full text Add to dashboard Cite

Due to the development of computer vision and natural language processing technologies in recent years, there has been a growing interest in multimodal intelligent tasks that require the ability to concurrently understand various forms of input data such as images and text. Vision-and-language navigation (VLN) require the alignment and grounding of multimodal input data to enable real-time perception of the task status on panoramic images and natural language instruction. This study proposes a novel deep neural network model (JMEBS), with joint multimodal embedding and backtracking search for VLN tasks. The proposed JMEBS model uses a transformer-based joint multimodal embedding module. JMEBS uses both multimodal context and temporal context. It also employs backtracking-enabled greedy local search (BGLS), a novel algorithm with a backtracking feature designed to improve the task success rate and optimize the navigation path, based on the local and global scores related to candidate actions. A novel global scoring method is also used for performance improvement by comparing the partial trajectories searched thus far with a plurality of natural language instructions. The performance of the proposed model on various operations was then experimentally demonstrated and compared with other models using the Matterport3D Simulator and room-to-room (R2R) benchmark datasets.

show abstract

Section: Related Workmentioning

confidence: 99%