MAGMA -- Multimodal Augmentation of Generative Models through Adapter-based Finetuning

Eichenberg, Constantin; Black, Sidney; Weinbach, Samuel; Pârcălăbescu, Letiția; Frank, Anette

doi:10.48550/arxiv.2112.05253

Cited by 9 publications

(15 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Several works followed this idea with architectural differences in the conditioning of the frozen language model. For instance, MAGMA (Eichenberg et al, 2021) adds bottleneck adapters (Houlsby et al, 2019; within the frozen language model; ClipCap (Mokady et al, 2021) proposes to use a vision-to-prefix transformer to map the vision features into a prefix instead of using a simple linear layer mapping. VC-GPT (Luo et al, 2022) moves away from the visual prefix tuning approach.…”

Section: Joint Vision and Language Modellingmentioning

confidence: 99%

Flamingo: a Visual Language Model for Few-Shot Learning

Alayrac¹,

Donahue²,

Luc³

et al. 2022

Preprint

View full text Add to dashboard Cite

ordered alphabetically, † Equal contributions, ordered alphabetically, ‡ Equal senior contributions Building models that can be rapidly adapted to numerous tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research. We introduce Flamingo, a family of Visual Language Models (VLM) with this ability. Flamingo models include key architectural innovations to: (i) bridge powerful pretrained vision-only and language-only models, (ii) handle sequences of arbitrarily interleaved visual and textual data, and (iii) seamlessly ingest images or videos as inputs. Thanks to their flexibility, Flamingo models can be trained on large-scale multimodal web corpora containing arbitrarily interleaved text and images, which is key to endow them with in-context few-shot learning capabilities. We perform a thorough evaluation of the proposed Flamingo models, exploring and measuring their ability to rapidly adapt to a variety of image and video understanding benchmarks. These include open-ended tasks such as visual question-answering, where the model is prompted with a question which it has to answer, captioning tasks, which evaluate the ability to describe a scene or an event, and close-ended tasks such as multiple choice visual question-answering. For tasks lying anywhere on this spectrum, we demonstrate that a single Flamingo model can achieve a new state of the art for few-shot learning, simply by prompting the model with task-specific examples. On many of these benchmarks, Flamingo actually surpasses the performance of models that are fine-tuned on thousands of times more task-specific data.

show abstract

Section: Joint Vision and Language Modellingmentioning

confidence: 99%

Flamingo: a Visual Language Model for Few-Shot Learning

Alayrac¹,

Donahue²,

Luc³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…To overcome these challenges, Prismer leverages powerful pre-trained domain expert models for data-efficient training. Unlike another set of works that prioritise in-context capability by conditioning on a large frozen language model with no task-specific fine-tuning [26,77,3], Prismer focuses on fine-tuned performance with an emphasis on parameter efficiency, using smaller but diverse pre-trained models.…”

Section: Related Workmentioning

confidence: 99%

Prismer: A Vision-Language Model with An Ensemble of Experts

Liu¹,

Fan²,

Johns³

et al. 2023

Preprint

View full text Add to dashboard Cite

Recent vision-language models have shown impressive multi-modal generation capabilities. However, typically they require training huge models on massive datasets. As a more scalable alternative, we introduce Prismer, a data-and parameter-efficient vision-language model that leverages an ensemble of domain experts. Prismer only requires training of a small number of components, with the majority of network weights inherited from readily-available, pre-trained domain experts, and kept frozen during training. By leveraging experts from a wide range of domains, we show that Prismer can efficiently pool this expert knowledge and adapt it to various vision-language reasoning tasks. In our experiments, we show that Prismer achieves fine-tuned and fewshot learning performance which is competitive with current state-of-the-art models, whilst requiring up to two orders of magnitude less training data. Code is available at https://github.com/NVlabs/prismer.

show abstract

“…To this end, we propose to generate text using a caption-generation model. Specifically, we used MAGMA (Multimodal Augmentation of Generative Models) [15]. MAGMA is a recent text generation model based on multimodal few-shot learners [47].…”

Section: Dataset Documentation Of Q16mentioning

confidence: 99%

“…Additionally, Q16 employs the recent autoregressive caption generation model MAGMA [15] to provide accessible documentation. Thus, Q16 assists dataset documentation and curation by answering Question 16 of [17], which also explains its name: Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety?…”

mentioning

confidence: 99%

Can Machines Help Us Answering Question 16 in Datasheets, and In Turn Reflecting on Inappropriate Content?

Schramowski¹,

Tauchmann²,

Kersting³

2022

Preprint

View full text Add to dashboard Cite

This paper contains images and descriptions that are offensive in nature.Large datasets underlying much of current machine learning raise serious issues concerning inappropriate content such as offensive, insulting, threatening, or might otherwise cause anxiety. This calls for increased dataset documentation, e.g., using datasheets. They, among other topics, encourage to reflect on the composition of the datasets. So far, this documentation, however, is done manually and therefore can be tedious and error-prone, especially for large image datasets. Here we ask the arguably "circular" question of whether a machine can help us reflect on inappropriate content, answering Question 16 in Datasheets. To this end, we propose to use the information stored in pre-trained transformer models to assist us in the documentation process. Specifically, prompt-tuning based on a dataset of socio-moral values steers CLIP to identify potentially inappropriate content, therefore reducing human labor. We then document the inappropriate images found using word clouds, based on captions generated using a vision-language model. The documentations of two popular, large-scale computer vision datasets-ImageNet and OpenImages-produced this way suggest that machines can indeed help dataset creators to answer Question 16 on inappropriate image content.

show abstract

MAGMA -- Multimodal Augmentation of Generative Models through Adapter-based Finetuning

Cited by 9 publications

References 22 publications

Flamingo: a Visual Language Model for Few-Shot Learning

Flamingo: a Visual Language Model for Few-Shot Learning

Prismer: A Vision-Language Model with An Ensemble of Experts

Can Machines Help Us Answering Question 16 in Datasheets, and In Turn Reflecting on Inappropriate Content?

Contact Info

Product

Resources

About