2020
DOI: 10.48550/arxiv.2012.08673
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

A Closer Look at the Robustness of Vision-and-Language Pre-trained Models

Linjie Li,
Zhe Gan,
Jingjing Liu

Abstract: Large-scale pre-trained multimodal transformers, such as ViLBERT and UNITER, have propelled the state of the art in vision-and-language (V+L) research to a new level. Although achieving impressive performance on standard tasks, to date, it still remains unclear how robust these pretrained models are. To investigate, we conduct a host of thorough evaluations on existing pre-trained models over 4 different types of V+L specific model robustness: (i) Linguistic Variation; (ii) Logical Reasoning; (iii) Visual Cont… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
14
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
3

Relationship

2
6

Authors

Journals

citations
Cited by 13 publications
(15 citation statements)
references
References 59 publications
1
14
0
Order By: Relevance
“…Robust VQA Benchmarks. Following [32], we also evaluate our model over on a suite of robust VQA benchmarks. VQA-Rephrasings [48] exposes VQA models to linguistic variations in questions, and measures consistency of model predicitions to different semantivally equivalent questions.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…Robust VQA Benchmarks. Following [32], we also evaluate our model over on a suite of robust VQA benchmarks. VQA-Rephrasings [48] exposes VQA models to linguistic variations in questions, and measures consistency of model predicitions to different semantivally equivalent questions.…”
Section: Methodsmentioning
confidence: 99%
“…We adopt the same strategy as VQA and use binary cross entropy to optimize the model training. For datasets built on the original VQAv2 val splits, we follow [32] to finetune all models on VQAv2 training split to avoid data contamination. For adversarial VQA datasets, we follow [33,50] to evaluate models finetuned on VQAv2+VG-QA.…”
Section: Appendixmentioning
confidence: 99%
“…Along the journey of VLP, researchers have investigated different training strategies [12,35], robustness [29], compression [10,11,47], probing analysis [3,31], and the extension to video-text modeling [24,28,43,44,59]. More recently, instead of using object detectors for image feature extraction, end-to-end VLP based on convolution networks and transformers are becoming popular [17,18,22,27,49].…”
Section: Related Workmentioning
confidence: 99%
“…Visual-linguistic Pre-training. Following the prominent progress in the transformer-based [63] pretraining in natural language [13,47,32,4,10,48], visual-linguistic pre-training models, either for im-age+text [39,59,8,37,20,73,36,15,35,40] or for video+text [56,35,41,75,33], have achieved great success on a number of downstream V+L tasks. Most existing VL models are designed in a two-step fashion: a pre-trained object detector is used to encode the image as set of regional features (as offline visual tokens) followed by pre-training on a large scale visual-linguistic corpus using tasks like masked language modeling, image-text matching or masked region modeling losses.…”
Section: Related Workmentioning
confidence: 99%