2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2020
DOI: 10.1109/cvpr42600.2020.01315
|View full text |Cite
|
Sign up to set email alerts
|

Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-Training

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
213
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
5
1

Relationship

0
6

Authors

Journals

citations
Cited by 203 publications
(213 citation statements)
references
References 16 publications
0
213
0
Order By: Relevance
“…Another drawback of the models is the use of a recurrent neural network to model the sequence of words used in natural language instructions, which is unsuitable for parallel processing. To overcome these limitations, some researchers developed pretrained models [ 20 , 21 ] in which natural language instructions and images for the VLN task are embedded together with large-scale benchmark datasets in addition to R2R datasets. VisualBERT [ 22 ], Vision-and-Language BERT (ViLBERT) [ 23 ], Visual-Linguistic BERT (VL-BERT) [ 24 ], and UNiversal Image-TExt Representation (UNITER) [ 25 ], are pretrained models applicable to various vision–language tasks.…”
Section: Related Workmentioning
confidence: 99%
See 4 more Smart Citations
“…Another drawback of the models is the use of a recurrent neural network to model the sequence of words used in natural language instructions, which is unsuitable for parallel processing. To overcome these limitations, some researchers developed pretrained models [ 20 , 21 ] in which natural language instructions and images for the VLN task are embedded together with large-scale benchmark datasets in addition to R2R datasets. VisualBERT [ 22 ], Vision-and-Language BERT (ViLBERT) [ 23 ], Visual-Linguistic BERT (VL-BERT) [ 24 ], and UNiversal Image-TExt Representation (UNITER) [ 25 ], are pretrained models applicable to various vision–language tasks.…”
Section: Related Workmentioning
confidence: 99%
“…VisualBERT [ 22 ], Vision-and-Language BERT (ViLBERT) [ 23 ], Visual-Linguistic BERT (VL-BERT) [ 24 ], and UNiversal Image-TExt Representation (UNITER) [ 25 ], are pretrained models applicable to various vision–language tasks. There are also models pretrained specifically for VLN tasks [ 20 , 21 ]. These VLN-specific models have a simple structure that immediately selects one of the candidate actions because they use only the multimodal context of the concurrently embedded data extracted according to natural language instructions and input images.…”
Section: Related Workmentioning
confidence: 99%
See 3 more Smart Citations