2022
DOI: 10.48550/arxiv.2204.10496
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Multimodal Adaptive Distillation for Leveraging Unimodal Encoders for Vision-Language Tasks

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

0
4
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
2
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(4 citation statements)
references
References 0 publications
0
4
0
Order By: Relevance
“…As their models sizes grow rapidly, recent works have also explored parameter-efficient learning and model compression methods, including adapters (Sung, Cho, and Bansal 2022;Houlsby et al 2019;Rebuffi, Bilen, andVedaldi 2018, 2017), prompt tuning (Gu et al 2021b (Wang et al 2020b;Li et al 2022c;Wang et al 2021;Ding et al 2023), distilling VL models is a relatively under-explored field, as pointed out by the review paper of (Chen et al 2022a). (Fang et al 2021) claim to be the first to distill vision-language Transformers, and MAD (Wang et al 2022b) claims to be the first to use multi-modal distillation for VL models. These two papers both focus on distilling Encoder-only VL Transformers.…”
Section: Related Workmentioning
confidence: 99%
“…As their models sizes grow rapidly, recent works have also explored parameter-efficient learning and model compression methods, including adapters (Sung, Cho, and Bansal 2022;Houlsby et al 2019;Rebuffi, Bilen, andVedaldi 2018, 2017), prompt tuning (Gu et al 2021b (Wang et al 2020b;Li et al 2022c;Wang et al 2021;Ding et al 2023), distilling VL models is a relatively under-explored field, as pointed out by the review paper of (Chen et al 2022a). (Fang et al 2021) claim to be the first to distill vision-language Transformers, and MAD (Wang et al 2022b) claims to be the first to use multi-modal distillation for VL models. These two papers both focus on distilling Encoder-only VL Transformers.…”
Section: Related Workmentioning
confidence: 99%
“…This is exemplified by models directly learning the shallow mapping between prior question words and shared class labels in the absence of sample-specific contextualized candidate options. Consequently, models develop false visual dependency (Cao et al, 2020;Wang et al, 2022b) as they may succeed in resolving VQA tasks (Selvaraju et al, 2016;Chen et al, 2020a;Gupta et al, 2022) utilizing irrelevant visual cues.…”
Section: Related Workmentioning
confidence: 99%
“…For instance, training strategies like (Gupta et al, 2022; and DS solutions, (Ray et al, 2019;Selvaraju et al, 2020;Ribeiro et al, 2019;Wang et al, 2022d) only focus on a single modality. Debiased training like (Wang et al, 2022b;Zhang et al, 2021b) require constraints of either a specific model structure or doubling the models' complexity. Other methods (Chen et al, 2020a;Liang et al, 2020) apply occlusion boxes or maskings on images or questions and thus drastically disturb data distribution, leading to nonsensical synthesized answers.…”
Section: Related Workmentioning
confidence: 99%
“…Previous works (Song et al, 2022;Subramanian et al, 2022;Shen et al, 2021;Wang et al, 2022b) demonstrated that CLIP can achieve strong zeroshot performance of vision-language tasks by converting original tasks into the image-text matching format. However, they mainly consider matching on an instance or global level, i.e., the whole image or sentence, ignoring the significance of finegrained elements, e.g., keywords in the sentence and objects in the image.…”
Section: Introductionmentioning
confidence: 99%