2022
DOI: 10.48550/arxiv.2204.03972
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

FashionCLIP: Connecting Language and Images for Product Representations

Abstract: The steady rise of online shopping goes hand in hand with the development of increasingly complex ML and NLP models. While most use cases are cast as specialized supervised learning problems, we argue that practitioners would greatly benefit from more transferable representations of products. In this work, we build on recent developments in contrastive learning to train FashionCLIP, a CLIP-like model for the fashion industry. We showcase its capabilities for retrieval, classification and grounding, and release… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
6
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
3
1
1

Relationship

1
8

Authors

Journals

citations
Cited by 24 publications
(6 citation statements)
references
References 17 publications
0
6
0
Order By: Relevance
“…Recently, large models have received more and more attention due to their excellent generalization ability, allowing for improvements in various downstream tasks in multiple fields. FashionCLIP [6], a CLIP-like model for fashion industry, was trained on 700K <image, text> pairs. Baldrati et al [7] proposed the Multimodal Garment Designer, which was the first diffusion model for fashion image editing and conditioned by text, body pose and sketches.…”
Section: Fashion Attribute Analysis/recognitionmentioning
confidence: 99%
“…Recently, large models have received more and more attention due to their excellent generalization ability, allowing for improvements in various downstream tasks in multiple fields. FashionCLIP [6], a CLIP-like model for fashion industry, was trained on 700K <image, text> pairs. Baldrati et al [7] proposed the Multimodal Garment Designer, which was the first diffusion model for fashion image editing and conditioned by text, body pose and sketches.…”
Section: Fashion Attribute Analysis/recognitionmentioning
confidence: 99%
“…Differently, Mirchandani et al [ 54 ] introduced a novel fashion-specific pre-training framework based on weakly supervised triplets, while in [ 53 ], two different pre-training tasks were proposed, one based on multi-view contrastive learning and the other on pseudo-attribute classification. Another recent approach exploits the power of the CLIP model [ 41 ]; it is fine-tuned on more specific vision-and-language data for the fashion domain [ 55 ].…”
Section: Related Workmentioning
confidence: 99%
“…Since our fashion data is also abundant, most early works pre-train on the fashion domain directly. However, a number of recent works [2,3,10,16,52] suggest that a generic-domain pre-trained CLIP [60] generalizes even better on the fashion tasks. In this work, we also exploit a pre-trained CLIP model.…”
Section: Related Workmentioning
confidence: 99%