FashionCLIP: Connecting Language and Images for Product Representations

Chia, Patrick John; Attanasio, Giuseppe; Bianchi, F.; Terragni, Silvia; Magalhães, Ana Rita; Gonçalves, Diogo Nunes; Greco, Ciro; Tagliabue, Jacopo

doi:10.48550/arxiv.2204.03972

Cited by 24 publications

(6 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recently, large models have received more and more attention due to their excellent generalization ability, allowing for improvements in various downstream tasks in multiple fields. FashionCLIP [6], a CLIP-like model for fashion industry, was trained on 700K <image, text> pairs. Baldrati et al [7] proposed the Multimodal Garment Designer, which was the first diffusion model for fashion image editing and conditioned by text, body pose and sketches.…”

Section: Fashion Attribute Analysis/recognitionmentioning

confidence: 99%

Large Scale Multimodal Fashion Care Recognition

Su,

Wang

2024

Preprint

View full text Add to dashboard Cite

Smart Fashion is reshaping people's lives, and affects people's choices and outfits. Existing computer-vision-enabled fashion technology has covered many aspects, such as fashion detection, fashion recognition, fashion segmentation, virtual fitting, fashion recommendation and fashion compatibility, etc. However, there is a gap in the direction of intelligent care. The care process is closely related to the lifetime of clothing, and also plays a very important role in the health and well-being of humans. The care label inside the clothing indicates the recommended care operation, which usually contains multiple care symbols and multilingual textual descriptions. Repeated washing can lead to fading and deformation of labels. Care label recognition is a challenging task in the wild scene. In this paper, we propose a strong multi-modal multi-task baseline (abbreviated as MMFC), which combines image features and text features into a united framework. The Modality Mutual Transformation Module (MMTM) is employed to enhance the feature fusion. We refine the alignment of different modality features utilizing the methodology of contrastive learning and feature mapping. The lack of care label datasets has limited the development of intelligent care. Therefore, we introduce a new high-quality large-scale dataset called FashionCare, which has 30,477 images, a total of 157,907 fashion care symbols, six major categories, 66 subcategories and textual description. To our knowledge, this is the first large-scale dataset of care label. Extensive experiments on FashionCare show the effectiveness of MMFC. In order to demonstrate the few-shot recognition performance of MMFC, we build a sub-dataset called FashionCare-LT by constructing the tail subcategories. Both quantitative and qualitative results show that MMFC possesses exceptional few-shot recognition capabilities. We believe that FashionCare can also be further explored to benefit more fashion related tasks, such as the care analysis of different materials and fashion types. We also hope that FashionCare can serve as a new benchmark for large-scale fine-grained multimodal learning, and contribute to the development of multimodal recognition, understanding and analysis.

show abstract

Section: Fashion Attribute Analysis/recognitionmentioning

confidence: 99%

Large Scale Multimodal Fashion Care Recognition

Su,

Wang

2024

Preprint

View full text Add to dashboard Cite

show abstract

“…Differently, Mirchandani et al [ 54 ] introduced a novel fashion-specific pre-training framework based on weakly supervised triplets, while in [ 53 ], two different pre-training tasks were proposed, one based on multi-view contrastive learning and the other on pseudo-attribute classification. Another recent approach exploits the power of the CLIP model [ 41 ]; it is fine-tuned on more specific vision-and-language data for the fashion domain [ 55 ].…”

Section: Related Workmentioning

confidence: 99%

Fashion-Oriented Image Captioning with External Knowledge Retrieval and Fully Attentive Gates

Moratelli

Barraco

Morelli

et al. 2023

Sensors

View full text Add to dashboard Cite

Research related to fashion and e-commerce domains is gaining attention in computer vision and multimedia communities. Following this trend, this article tackles the task of generating fine-grained and accurate natural language descriptions of fashion items, a recently-proposed and under-explored challenge that is still far from being solved. To overcome the limitations of previous approaches, a transformer-based captioning model was designed with the integration of external textual memory that could be accessed through k-nearest neighbor (kNN) searches. From an architectural point of view, the proposed transformer model can read and retrieve items from the external memory through cross-attention operations, and tune the flow of information coming from the external memory thanks to a novel fully attentive gate. Experimental analyses were carried out on the fashion captioning dataset (FACAD) for fashion image captioning, which contains more than 130k fine-grained descriptions, validating the effectiveness of the proposed approach and the proposed architectural strategies in comparison with carefully designed baselines and state-of-the-art approaches. The presented method constantly outperforms all compared approaches, demonstrating its effectiveness for fashion image captioning.

show abstract

“…Since our fashion data is also abundant, most early works pre-train on the fashion domain directly. However, a number of recent works [2,3,10,16,52] suggest that a generic-domain pre-trained CLIP [60] generalizes even better on the fashion tasks. In this work, we also exploit a pre-trained CLIP model.…”

Section: Related Workmentioning

confidence: 99%

FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks

Han¹,

Zhu²,

Yu³

et al. 2023

Preprint

View full text Add to dashboard Cite

In the fashion domain, there exists a variety of visionand-language (V+L) tasks, including cross-modal retrieval, text-guided image retrieval, multi-modal classification, and image captioning. They differ drastically in each individual input/output format and dataset size. It has been common to design a task-specific model and fine-tune it independently from a pre-trained V+L model (e.g., CLIP). This results in parameter inefficiency and inability to exploit inter-task relatedness. To address such issues, we propose a novel FAshion-focused Multi-task Efficient learning method for Vision-and-Language tasks (FAME-ViL) in this work. Compared with existing approaches, FAME-ViL applies a single model for multiple heterogeneous fashion tasks, therefore being much more parameter-efficient. It is enabled by two novel components: (1) a task-versatile architecture with cross-attention adapters and task-specific adapters integrated into a unified V+L model, and (2) a stable and effective multi-task training strategy that supports learning from heterogeneous data and prevents negative transfer. Extensive experiments on four fashion tasks show that our FAME-ViL can save 61.5% of parameters over alternatives, while significantly outperforming the conventional independently trained single-task models. Code is available at https://github.com/BrandonHanx/FAME-ViL.

show abstract

FashionCLIP: Connecting Language and Images for Product Representations

Cited by 24 publications

References 17 publications

Large Scale Multimodal Fashion Care Recognition

Large Scale Multimodal Fashion Care Recognition

Fashion-Oriented Image Captioning with External Knowledge Retrieval and Fully Attentive Gates

FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks

Contact Info

Product

Resources

About