“…The ability to understand and generate natural language from visual information is a critical component of many real-world applications, including visual question answering (VQA), visual reasoning, and multimodal information retrieval. In recent years, the success of deep learning in natural language processing (NLP) has led to the development of large-scale vision-language models (VLMs) (Tan and Bansal, 2019;Li et al, 2021b;Kim et al, 2021a;Alayrac et al, 2022;Wang et al, 2022c;Shen et al, 2022b;Li et al, 2021a;Shen et al, 2022a;Jia et al, 2021; that leverage powerful neural network architectures to encode and decode multimodal information. However, state-of-the-art vision-language models like Flamingo-80B (Alayrac et al, 2022), BEIT-3-1.9B , and PaLI-17B can be computationally expensive and difficult to train, which has motivated researchers to explore ways of improving their efficiency and effectiveness.…”