2022
DOI: 10.1007/978-3-031-19809-0_30
|View full text |Cite
|
Sign up to set email alerts
|

SLIP: Self-supervision Meets Language-Image Pre-training

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

3
62
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 128 publications
(65 citation statements)
references
References 23 publications
3
62
0
Order By: Relevance
“…The core technique of CLIP is aligning both vision and language modalities in a joint embedding space by global representation contrastive. Followup works further improve CLIP from the vision-only [43,67] or vision-language [38,64,68,69] side. In this paper, we bring natural language supervision together with masked image modeling for better visual pre-training on these two paradigms.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…The core technique of CLIP is aligning both vision and language modalities in a joint embedding space by global representation contrastive. Followup works further improve CLIP from the vision-only [43,67] or vision-language [38,64,68,69] side. In this paper, we bring natural language supervision together with masked image modeling for better visual pre-training on these two paradigms.…”
Section: Related Workmentioning
confidence: 99%
“…We evaluate the zero-shot classification over 21 benchmarks including ImageNet-1K [15]. Detail of each datasets are listed in the appendix and the evaluate recipes (e.g., prompt engineering) strictly follow [43]. Results are shown in Table 6.…”
Section: Zero-shot Transfermentioning
confidence: 99%
See 1 more Smart Citation
“…It is relatively easy to collect a large-scale dataset of image-text paired data from the web [34,35]; Many works studied to enhance the learned visual representations of the model with web image-text datasets [26,27,30,32,54,55]. CLIP [32] tackles this challenge by performing contrastive learning over the paired image and text data.…”
Section: Vision Learners With Web Image-text Pairsmentioning
confidence: 99%
“…Differing from the above single-modal SSL methods, pioneering multi-modal SSL methods (CLIP [32] and ALIGN [24]) have shown great scalability, i.e., easily collected large-scale web image-text pairs [34,35,38] can contribute to great transfer learning performances. The following works also proved that the textual information in web image-text pairs could greatly benefit various downstream tasks [26,27,30,54]. In a fair condition, with the same web image-text dataset and training schedule, we first conduct a benchmark study of single-modal SSL methods (with only images) and multi-modal SSL methods (with imagetext pairs).…”
Section: Introductionmentioning
confidence: 99%