2021
DOI: 10.48550/arxiv.2112.05253
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

MAGMA -- Multimodal Augmentation of Generative Models through Adapter-based Finetuning

Abstract: Large-scale pretraining is fast becoming the norm in Vision-Language (VL) modeling. However, prevailing VL approaches are limited by the requirement for labeled data and the use of complex multi-step pretraining objectives. We present MAGMA -a simple method for augmenting generative language models with additional modalities using adapter-based finetuning. Building on Frozen [52], we train a series of VL models that autoregressively generate text from arbitrary combinations of visual and textual input. The pre… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
13
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
7
1
1

Relationship

0
9

Authors

Journals

citations
Cited by 9 publications
(15 citation statements)
references
References 22 publications
0
13
0
Order By: Relevance
“…Several works followed this idea with architectural differences in the conditioning of the frozen language model. For instance, MAGMA (Eichenberg et al, 2021) adds bottleneck adapters (Houlsby et al, 2019; within the frozen language model; ClipCap (Mokady et al, 2021) proposes to use a vision-to-prefix transformer to map the vision features into a prefix instead of using a simple linear layer mapping. VC-GPT (Luo et al, 2022) moves away from the visual prefix tuning approach.…”
Section: Joint Vision and Language Modellingmentioning
confidence: 99%
“…Several works followed this idea with architectural differences in the conditioning of the frozen language model. For instance, MAGMA (Eichenberg et al, 2021) adds bottleneck adapters (Houlsby et al, 2019; within the frozen language model; ClipCap (Mokady et al, 2021) proposes to use a vision-to-prefix transformer to map the vision features into a prefix instead of using a simple linear layer mapping. VC-GPT (Luo et al, 2022) moves away from the visual prefix tuning approach.…”
Section: Joint Vision and Language Modellingmentioning
confidence: 99%
“…To overcome these challenges, Prismer leverages powerful pre-trained domain expert models for data-efficient training. Unlike another set of works that prioritise in-context capability by conditioning on a large frozen language model with no task-specific fine-tuning [26,77,3], Prismer focuses on fine-tuned performance with an emphasis on parameter efficiency, using smaller but diverse pre-trained models.…”
Section: Related Workmentioning
confidence: 99%
“…To this end, we propose to generate text using a caption-generation model. Specifically, we used MAGMA (Multimodal Augmentation of Generative Models) [15]. MAGMA is a recent text generation model based on multimodal few-shot learners [47].…”
Section: Dataset Documentation Of Q16mentioning
confidence: 99%
“…Additionally, Q16 employs the recent autoregressive caption generation model MAGMA [15] to provide accessible documentation. Thus, Q16 assists dataset documentation and curation by answering Question 16 of [17], which also explains its name: Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety?…”
mentioning
confidence: 99%