2020
DOI: 10.48550/arxiv.2006.06666
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

VirTex: Learning Visual Representations from Textual Annotations

Abstract: The de-facto approach to many vision tasks is to start from pretrained visual representations, typically learned via supervised training on ImageNet. Recent methods have explored unsupervised pretraining to scale to vast quantities of unlabeled images. In contrast, we aim to learn high-quality visual representations from fewer images. To this end we revisit supervised pretraining, and seek data-efficient alternatives to classification-based pretraining. We propose VirTex -a pretraining approach using semantica… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
50
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
2
1
1

Relationship

0
8

Authors

Journals

citations
Cited by 29 publications
(50 citation statements)
references
References 79 publications
(157 reference statements)
0
50
0
Order By: Relevance
“…[28] shows that pretraining by predicting hashtags on Instagram improves performance on ImageNet by over 5%. [8,35,44] all demonstrate the effectiveness of transformer-based language modeling in learning image representation from text. CLIP [32] and ALIGN [21] apply natural language supervision to the domain of ZSL.…”
Section: Introductionmentioning
confidence: 88%
“…[28] shows that pretraining by predicting hashtags on Instagram improves performance on ImageNet by over 5%. [8,35,44] all demonstrate the effectiveness of transformer-based language modeling in learning image representation from text. CLIP [32] and ALIGN [21] apply natural language supervision to the domain of ZSL.…”
Section: Introductionmentioning
confidence: 88%
“…In addition, (Joulin* et al, 2016;Li et al, 2017;Desai & Johnson, 2020;Sariyildiz et al, 2020) demonstrate that good visual representations can be learned by predicting image captions. To scale up vision-language joint training, CLIP (Radford et al, 2021) and ALIGN both collect their own image-text datasets with 400M and 1B image-caption pairs.…”
Section: Related Workmentioning
confidence: 99%
“…Recently, natural language has been used as a powerful source of supervision for visual representation learning. (Desai & Johnson, 2020;Sariyildiz et al, 2020; demonstrate the effectiveness of pretraining on image-text data. Among them, CLIP (Radford et al, 2021) applies natural language supervision to zero-shot image recognition.…”
Section: Introductionmentioning
confidence: 99%
“…There is an alternative approach that directly computes the similarity score without having modality-wise representation. A typical example is the cross-modal attention models [11,22,30,32] (details in Sec. 4).…”
Section: Problem Setup and Backgroundmentioning
confidence: 99%
“…The main benefit of the former is the computational efficiency, scalable to billions of instances at training/test time, thanks to the efficient dotproduct. The latter directly computes the similarity score without having modality-wise representation [11,22,30,32] using the transformer-like attentive neural networks which aim to capture interactions between local features in the instances from different modalities. Although they can capture cross-modal interactions between local features of data instances from different modalities, they are computationally demanding and very slow due to the quadratic complexity in the number of local features.…”
Section: Related Workmentioning
confidence: 99%