2021
DOI: 10.48550/arxiv.2112.04482
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

FLAVA: A Foundational Language And Vision Alignment Model

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
28
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 21 publications
(28 citation statements)
references
References 0 publications
0
28
0
Order By: Relevance
“…DeCLIP [26] utilized more image-text pairs collected from CLIP [30] by adding multiple self-supervised techniques. Inspired by BERT, other methods study cross-modal matching loss [8,28,34,35,44]. FLAVA [34] employs both contrastive and multimodal training objectives on paired and image-only datasets.…”
Section: Related Workmentioning
confidence: 99%
“…DeCLIP [26] utilized more image-text pairs collected from CLIP [30] by adding multiple self-supervised techniques. Inspired by BERT, other methods study cross-modal matching loss [8,28,34,35,44]. FLAVA [34] employs both contrastive and multimodal training objectives on paired and image-only datasets.…”
Section: Related Workmentioning
confidence: 99%
“…[7] and [56] propose a simple framework that can process information from multiple modalities with a uniform byte-sequence representation. [57] and [58] unify tasks of different modalities by designing various task-specific layers. [59] explores to employ a retrieval-based unified paradigm.…”
Section: Related Workmentioning
confidence: 99%
“…SLIP [31] learned a joint representation by leveraging both a paired dataset, and a much larger image-only dataset, using self-supervised techniques. FLAVA [40] employs both contrastive and multi-modal training objectives on paired and image-only datasets. The joint representation was shown to hold a strong semantic alignment between the two modalities, enabling image generation [29,47], image manipulation [2,34], and image captioning [30].…”
Section: Retrieval For Generationmentioning
confidence: 99%
“…For Photo-Realistic experiments we use a modified version of the Public Multimodal Dataset (PMD) used by FLAVA [40]. The modified PMD dataset is composed from the following set of publicly available Text-Image datasets: SBU Captions [33], Localized Narratives [35], Conceptual Captions [38], Visual Genome [25], Wikipedia Image Text [42], Conceptual Captions 12M [7], Red Caps [9], and, a filtered version of YFCC100M [43].…”
Section: Datasetsmentioning
confidence: 99%