VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding

Ghosh, Gargi; Huang, Po-Yao; Arora, Prahal; Aminzadeh, Masoumeh; Feichtenhofer, Christoph; Metze, Florian; Zettlemoyer, Luke

doi:10.48550/arxiv.2105.09996

Cited by 12 publications

(19 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…35.4 [145] (6K) 138.7 [142] (10K) 36.7 [138] (46K) 75.2 [87] (123K) 54.7 [147] (20K) 25.2 [139] (38K) 75.4 [60] (9K) -…”

Section: Training Details For the Flamingo Modelsmentioning

confidence: 99%

Flamingo: a Visual Language Model for Few-Shot Learning

Alayrac¹,

Donahue²,

Luc³

et al. 2022

Preprint

View full text Add to dashboard Cite

ordered alphabetically, † Equal contributions, ordered alphabetically, ‡ Equal senior contributions Building models that can be rapidly adapted to numerous tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research. We introduce Flamingo, a family of Visual Language Models (VLM) with this ability. Flamingo models include key architectural innovations to: (i) bridge powerful pretrained vision-only and language-only models, (ii) handle sequences of arbitrarily interleaved visual and textual data, and (iii) seamlessly ingest images or videos as inputs. Thanks to their flexibility, Flamingo models can be trained on large-scale multimodal web corpora containing arbitrarily interleaved text and images, which is key to endow them with in-context few-shot learning capabilities. We perform a thorough evaluation of the proposed Flamingo models, exploring and measuring their ability to rapidly adapt to a variety of image and video understanding benchmarks. These include open-ended tasks such as visual question-answering, where the model is prompted with a question which it has to answer, captioning tasks, which evaluate the ability to describe a scene or an event, and close-ended tasks such as multiple choice visual question-answering. For tasks lying anywhere on this spectrum, we demonstrate that a single Flamingo model can achieve a new state of the art for few-shot learning, simply by prompting the model with task-specific examples. On many of these benchmarks, Flamingo actually surpasses the performance of models that are fine-tuned on thousands of times more task-specific data.

show abstract

“…35.4 [145] (6K) 138.7 [142] (10K) 36.7 [138] (46K) 75.2 [87] (123K) 54.7 [147] (20K) 25.2 [139] (38K) 75.4 [60] (9K) -…”

Section: Training Details For the Flamingo Modelsmentioning

confidence: 99%

Flamingo: a Visual Language Model for Few-Shot Learning

Alayrac¹,

Donahue²,

Luc³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Our pre-trained model achieves higher performance with lower computation cost. Finally, some work [27,28,30,50,55] adopts a joint encoder to concatenate videos and texts as inputs, thus every text-video pair needs to be fed into the encoder during inference, resulting in low efficiency for retrieval. By comparison, our model adopts the efficient "dual-encoder" architecture with only a video encoder and a text encoder for inference.…”

Section: Methodsmentioning

confidence: 99%

“…Although these methods are efficient for video-text retrieval, they ignore local semantics and fine-grained alignment between modalities. Methods in the second category [27,28,30,47,50,55] adopt the "jointencoder" architectures to interact cross-modality local features through concatenating videos and texts as inputs with a binary classifier to predict whether videos and texts are aligned or not. Despite they can build local associations between videos and texts, they sacrifice the retrieval efficiency since every text-video pair needs to be fed into the encoder during inference.…”

Section: Related Workmentioning

confidence: 99%

“…For example, [55] uses MRM, which masks the object regions of video frames with a pre-trained detection model [40] and predicts a distribution over fixed vocabulary for the masked-out frame region. [28,30,50] adopts MFM, which masks video frames and recovers the masked frames to the features encoded from an off-the-shelf video feature extraction network [49]. Both MRM and MFM rely on pretrained models with extra data to obtain visual-only reconstruction targets, while our work evolves a snapshot video encoder to provide video-text aligned reconstruction targets without additional training stage on extra data.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval

Ge¹,

Ge²,

Liu³

et al. 2022

Preprint

View full text Add to dashboard Cite

Dominant pre-training work for video-text retrieval mainly adopt the "dual-encoder" architectures to enable efficient retrieval, where two separate encoders are used to contrast global video and text representations, but ignore detailed local semantics. The recent success of image BERT pre-training with masked visual modeling that promotes the learning of local visual context, motivates a possible solution to address the above limitation. In this work, we for the first time investigate masked visual modeling in videotext pre-training with the "dual-encoder" architecture. We perform Masked visual modeling with Injected LanguagE Semantics (MILES) by employing an extra snapshot video encoder as an evolving "tokenizer" to produce reconstruction targets for masked video patch prediction. Given the corrupted video, the video encoder is trained to recover text-aligned features of the masked patches via reasoning with the visible regions along the spatial and temporal dimensions, which enhances the discriminativeness of local visual features and the fine-grained cross-modality alignment. Our method outperforms state-of-the-art methods for text-to-video retrieval on four datasets with both zero-shot and fine-tune evaluation protocols. Our approach also surpasses the baseline models significantly on zero-shot action recognition, which can be cast as video-to-text retrieval.

show abstract

“…An instructional or how-to video contains a human subject demonstrating and narrating how to accomplish a certain task. Early works on HowTo100M have focused on leveraging this large collection for learning models that can be transferred to other tasks, such as action recognition [4,37,38], video captioning [24,36,66], or text-video retrieval [7,37,61]. The problem of recognizing the task performed in the instructional video has been considered by Bertasius et al [8].…”

Section: Related Workmentioning

confidence: 99%

Learning To Recognize Procedural Activities with Distant Supervision

Lin¹,

Petroni²,

Bertasius³

et al. 2022

Preprint

View full text Add to dashboard Cite

In this paper we consider the problem of classifying fine-grained, multi-step activities (e.g., cooking different recipes, making disparate home improvements, creating various forms of arts and crafts) from long videos spanning up to several minutes. Accurately categorizing these activities requires not only recognizing the individual steps that compose the task but also capturing their temporal dependencies. This problem is dramatically different from traditional action classification, where models are typically optimized on videos that span only a few seconds and that are manually trimmed to contain simple atomic actions. While step annotations could enable the training of models to recognize the individual steps of procedural activities, existing large-scale datasets in this area do not include such segment labels due to the prohibitive cost of manually annotating temporal boundaries in long videos. To address this issue, we propose to automatically identify steps in instructional videos by leveraging the distant supervision of a textual knowledge base (wikiHow) that includes detailed descriptions of the steps needed for the execution of a wide variety of complex activities. Our method uses a language model to match noisy, automatically-transcribed speech from the video to step descriptions in the knowledge base. We demonstrate that video models trained to recognize these automatically-labeled steps (without manual supervision) yield a representation that achieves superior generalization performance on four downstream tasks: recognition of procedural activities, step classification, step forecasting and egocentric video classification.

show abstract

VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding

Cited by 12 publications

References 25 publications

Flamingo: a Visual Language Model for Few-Shot Learning

Flamingo: a Visual Language Model for Few-Shot Learning

MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval

Learning To Recognize Procedural Activities with Distant Supervision

Contact Info

Product

Resources

About