2020
DOI: 10.48550/arxiv.2006.06195
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Large-Scale Adversarial Training for Vision-and-Language Representation Learning

Zhe Gan,
Yen-Chun Chen,
Linjie Li
et al.

Abstract: We present VILLA, the first known effort on large-scale adversarial training for vision-and-language (V+L) representation learning. VILLA consists of two training stages: (i) task-agnostic adversarial pre-training; followed by (ii) task-specific adversarial finetuning. Instead of adding adversarial perturbations on image pixels and textual tokens, we propose to perform adversarial training in the embedding space of each modality. To enable large-scale training, we adopt the "free" adversarial training strategy… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
43
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 29 publications
(43 citation statements)
references
References 63 publications
0
43
0
Order By: Relevance
“…Similarly, [48] and [284] It is also claimed by Salman et al [286] that adversarially trained models, while less accurate than the standard models, often perform better for transfer learning. In another study, Gan et al [287] propose VILLA, a representation learning approach based on large-scale adversarial training on visionand-language data. They perform a task-agnostic adversarial training followed by a task-specific adversarial fine-tuning in the embedding space.…”
Section: A Improving Model Performancementioning
confidence: 99%
“…Similarly, [48] and [284] It is also claimed by Salman et al [286] that adversarially trained models, while less accurate than the standard models, often perform better for transfer learning. In another study, Gan et al [287] propose VILLA, a representation learning approach based on large-scale adversarial training on visionand-language data. They perform a task-agnostic adversarial training followed by a task-specific adversarial fine-tuning in the embedding space.…”
Section: A Improving Model Performancementioning
confidence: 99%
“…Recent years have seen a rapid progress made in vision-language pretraining (Uppal et al, 2020;Han et al, 2021;Khan et al, 2021). While a variety of approaches have been proposed, a large portion of them require object detection for image region feature regression or tagging as part of the pre-training objectives, for example LXMERT (Tan & Bansal, 2019), VLBERT (Su et al, 2020), VisualBERT (Li et al, 2019), UNITER (Chen et al, 2020b), Villa (Gan et al, 2020), Oscar , ERNIE-ViL (Yu et al, 2021), UNIMO , VinVL , VIVO VL-T5 (Cho et al, 2021) etc. These methods rely on a strong object detection model like Fast(er) R-CNN (Ren et al, 2015), which is often trained on human annotated data sets like Visual Genome (Krishna et al, 2016).…”
Section: Related Workmentioning
confidence: 99%
“…To examine the quality of vision-language pretraining, we first compare SimVLM on the popular multi-modal tasks described in Sec. 4.1.2 with state-of-the-art (SOTA) VLP methods including LXMERT (Tan & Bansal, 2019), VL-T5 (Cho et al, 2021), UNITER (Chen et al, 2020b), OSCAR , Villa (Gan et al, 2020), SOHO , UNIMO , and VinVL .…”
Section: Comparison With Existing Approachesmentioning
confidence: 99%
“…Multiple VQA datasets have been proposed, such as Visual Genome QA [25] VQA [2], GQA [16], CLEVR [22], MovieQA [53] and so on. Many works have shown state-of-the-art performance on VQA tasks, including task-specific VQA models with various cross-modality fusion mechanisms [13,20,24,49,62,66,67] and joint vision-language models that are pretrained on large-scale vision-language corpus and finetuned on VQA tasks [6,11,29,30,33,52,68]. Please note that the conventional VQA task does not require external knowledge by definition, although studies show some VQA questions may require commonsense knowledge to answer correctly [2].…”
Section: Related Workmentioning
confidence: 99%