Large-scale pretraining is fast becoming the norm in Vision-Language (VL) modeling. However, prevailing VL approaches are limited by the requirement for labeled data and the use of complex multi-step pretraining objectives. We present MAGMA -a simple method for augmenting generative language models with additional modalities using adapter-based finetuning. Building on Frozen [52], we train a series of VL models that autoregressively generate text from arbitrary combinations of visual and textual input. The pretraining is entirely end-to-end using a single language modeling objective, simplifying optimization compared to previous approaches. Importantly, the language model weights remain unchanged during training, allowing for transfer of encyclopedic knowledge and in-context learning abilities from language pretraining. MAGMA outperforms Frozen on open-ended generative tasks, achieving state of the art results on the OKVQA benchmark and competitive results on a range of other popular VL benchmarks, while pretraining on ∼ 0.2% of the number of samples used to train SimVLM [55].
We consider a class of mass transfer models on a one-dimensional lattice with nearest-neighbour interactions. The evolution is given by the backward parabolic equation ∂ t x = − β |β| ∆x β , with β in the fast diffusion regime (−∞, 0) ∪ (0, 1]. Sites with mass zero are deleted from the system, which leads to a coarsening of the mass distribution. The rate of coarsening suggested by scaling is t A Appendix 25 A.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.