Align and Prompt: Video-and-Language Pre-training with Entity Prompts

Li, Dongxu; Li, Junnan; Li, Hongdong; Niebles, Juan Carlos; Hoi, Steven C. H.

doi:10.1109/cvpr52688.2022.00490

Cited by 338 publications

(544 citation statements)

References 29 publications

Supporting

Mentioning

542

Contrasting

Unclassified

Order By: Relevance

“…Vision-language pretraining (VLP) aims to learn joint visual-textual representations for a variety of multimodal downstream tasks. Existing works either learn unimodal encoders by distinguishing the positive pair(s) from the unpaired samples [3,28,41] or focus on one multimodal encoder for joint feature learning with masked image/language modeling and image-text matching losses [12,27,29]. Additionally, some approaches seek fine-grained supervision for cross-modal interaction [20,30,[54][55][56].…”

Section: Related Workmentioning

confidence: 99%

Learning Open-vocabulary Semantic Segmentation Models From Natural Language Supervision

Xu¹,

Hou²,

Zhang³

et al. 2023

Preprint

View full text Add to dashboard Cite

In this paper, we consider the problem of openvocabulary semantic segmentation (OVS), which aims to segment objects of arbitrary classes instead of pre-defined, closed-set categories. The main contributions are as follows: First, we propose a transformer-based model for OVS, termed as OVSegmentor, which only exploits webcrawled image-text pairs for pre-training without using any mask annotations. OVSegmentor assembles the image pixels into a set of learnable group tokens via a slot-attention based binding module, and aligns the group tokens to the corresponding caption embedding. Second, we propose two proxy tasks for training, namely masked entity completion and cross-image mask consistency. The former aims to infer all masked entities in the caption given the group tokens, that enables the model to learn fine-grained alignment between visual groups and text entities. The latter enforces consistent mask predictions between images that contain shared entities, which encourages the model to learn visual invariance. Third, we construct CC4M dataset for pre-training by filtering CC12M with frequently appeared entities, which significantly improves training efficiency. Fourth, we perform zero-shot transfer on three benchmark datasets, PASCAL VOC 2012, PASCAL Context, and COCO Object. Our model achieves superior segmentation results over the state-of-the-art method by using only 3% data (4M vs 134M) for pre-training. Code and pre-trained models will be released for future research.

show abstract

Section: Related Workmentioning

confidence: 99%

Learning Open-vocabulary Semantic Segmentation Models From Natural Language Supervision

Xu¹,

Hou²,

Zhang³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…LAVIS currently supports 4 foundation models, i.e. AL-BEF [34], BLIP [33], CLIP [44] and ALPRO [32].…”

Section: Supported Tasks Datasets and Modelsmentioning

confidence: 99%

“…(iii) State-ofthe-art and reproducible language-vision models. The li-brary enables access to over 30 pre-trained and task-specific fine-tuned model checkpoints of four foundation models: ALBEF [34], BLIP [33], CLIP [44] and ALPRO [32]. These models achieve competitive performance across multiple tasks evaluated using common metrics.…”

Section: Introductionmentioning

confidence: 99%

LAVIS: A Library for Language-Vision Intelligence

Li¹,

Li²,

Lê³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

We introduce LAVIS, an open-source deep learning library for LAnguage-VISion research and applications. LAVIS aims to serve as a one-stop comprehensive library that brings recent advancements in the language-vision field accessible for researchers and practitioners, as well as fertilizing future research and development. It features a unified interface to easily access state-of-the-art imagelanguage, video-language models and common datasets. LAVIS supports training, evaluation and benchmarking on a rich variety of tasks, including multimodal classification, retrieval, captioning, visual question answering, dialogue and pre-training. In the meantime, the library is also highly extensible and configurable, facilitating future development and customization. In this technical report, we describe design principles, key components and functionalities of the library, and also present benchmarking results across common language-vision tasks. The library is available at: https://github.com/salesforce/LAVIS.

show abstract

“…To alleviate the temporal misalignment issue, it incorporates an entropy minimization-based constrained attention loss to encourage the model to automatically focus on the correct captions from a pool of candidate ASR captions. Li et al (2022d) propose a new visually-grounded pre-training task, prompting entity modeling (PEM), to learn fine-grained region-entity alignment. The prediction targets for the PEM task are generated by an entity prompter module, trained with contrastive learning to produce the similarity between a video crop and text prompts instantiated with entity names.…”

Section: Advanced Pre-training Tasksmentioning

confidence: 99%

Vision-Language Pre-training: Basics, Recent Advances, and Future Trends

Gan¹,

Fu²,

Li³

et al. 2022

Preprint

View full text Add to dashboard Cite

This paper surveys vision-language pre-training (VLP) methods for multimodal intelligence that have been developed in the last few years. We group these approaches into three categories: (i) VLP for image-text tasks, such as image captioning, image-text retrieval, visual question answering, and visual grounding; (ii) VLP for core computer vision tasks, such as (open-set) image classification, object detection, and segmentation; and (iii) VLP for video-text tasks, such as video captioning, video-text retrieval, and video question answering. For each category, we present a comprehensive review of state-of-the-art methods, and discuss the progress that has been made and challenges still being faced, using specific systems and models as case studies. In addition, for each category, we discuss advanced topics being actively explored in the research community, such as big foundation models, unified modeling, in-context few-shot learning, knowledge, robustness, and computer vision in the wild, to name a few.♠ Zhe Gan and Jianfeng Gao initiated the project. Zhe Gan and Linjie Li took lead in the writing of Chapter 1. Linjie Li and Jianfeng Gao took lead in the writing of Chapter 2. Zhe Gan further took lead in the writing of Chapter 3 and 7. Chunyuan Li took lead in the writing of Chapter 4. Linjie Li further took lead in the writing of Chapter 5. Lijuan Wang and Zicheng Liu took lead in the writing of Chapter 6. All the authors provided project advice, and contributed to paper editing and proofreading.

show abstract

Align and Prompt: Video-and-Language Pre-training with Entity Prompts

Cited by 338 publications

References 29 publications

Learning Open-vocabulary Semantic Segmentation Models From Natural Language Supervision

Learning Open-vocabulary Semantic Segmentation Models From Natural Language Supervision

LAVIS: A Library for Language-Vision Intelligence

Vision-Language Pre-training: Basics, Recent Advances, and Future Trends

Contact Info

Product

Resources

About