2022
DOI: 10.48550/arxiv.2205.14459
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

CyCLIP: Cyclic Contrastive Language-Image Pretraining

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
5
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
1

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(5 citation statements)
references
References 0 publications
0
5
0
Order By: Relevance
“…PyramidCLIP (Gao et al 2022) andFILIP (Yao et al 2021) introduce finer-grained and more interactions between two modalities, seeking for more accurate cross-modal alignment. CyCLIP (Goel et al 2022) points out the importance of geometric consistency in the learned representation space between two modalities, and proposes geometrically consistency constraints. Different from dual-stream ones, single-stream models, such as Visual-BERT (Li et al 2019) and OSCAR (Li et al 2020), fuse the image and text features with a unified model to achieve deeper interaction.…”
Section: Related Work Vision Language Pre-trainingmentioning
confidence: 99%
“…PyramidCLIP (Gao et al 2022) andFILIP (Yao et al 2021) introduce finer-grained and more interactions between two modalities, seeking for more accurate cross-modal alignment. CyCLIP (Goel et al 2022) points out the importance of geometric consistency in the learned representation space between two modalities, and proposes geometrically consistency constraints. Different from dual-stream ones, single-stream models, such as Visual-BERT (Li et al 2019) and OSCAR (Li et al 2020), fuse the image and text features with a unified model to achieve deeper interaction.…”
Section: Related Work Vision Language Pre-trainingmentioning
confidence: 99%
“…Recent advancements in contrastive learning have enabled CLIP [46] to perform multimodal learning with 400M noisy data crawled from the web. CLIP has been extended for high efficiency model training and cycle consistency through various methods, such as AL-BEF [24] and Cyclip [15]. BLIP [23] includes text-to-image generation as an auxiliary task, which results in better performance by utilizing synthetic data as a bonus.…”
Section: Related Workmentioning
confidence: 99%
“…DeepAugment [23] was one of the first augmentation strategies to perform well on natural distribution shifts. Additionally, studies on CLIP-verse [39,29,33,18,35] have shown natural robustness. In our work, we take the best of both paradigms by leveraging the strengths of modern generative models to augment real datasets.…”
Section: Related Workmentioning
confidence: 99%