2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022
DOI: 10.1109/cvpr52688.2022.01752
|View full text |Cite
|
Sign up to set email alerts
|

EI-CLIP: Entity-aware Interventional Contrastive Learning for E-commerce Cross-modal Retrieval

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
15
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 37 publications
(16 citation statements)
references
References 37 publications
1
15
0
Order By: Relevance
“…Radford et al [8] also provided a CNN version of the image encoder. However, due to the advantage of ViT on pre-training, the ViT encoder outperformed the CNN encoder, and it is most commonly applied in other works [9]. CLIP has recently been used in various tasks, such as e-commerce image retrieval [9], text-image generation [10], and image segmentation [18].…”
Section: The Network Of Clipmentioning
confidence: 99%
See 2 more Smart Citations
“…Radford et al [8] also provided a CNN version of the image encoder. However, due to the advantage of ViT on pre-training, the ViT encoder outperformed the CNN encoder, and it is most commonly applied in other works [9]. CLIP has recently been used in various tasks, such as e-commerce image retrieval [9], text-image generation [10], and image segmentation [18].…”
Section: The Network Of Clipmentioning
confidence: 99%
“…However, due to the advantage of ViT on pre-training, the ViT encoder outperformed the CNN encoder, and it is most commonly applied in other works [9]. CLIP has recently been used in various tasks, such as e-commerce image retrieval [9], text-image generation [10], and image segmentation [18]. However, CLIP still lacks the ability to effectively match local information in images to their descriptions in cross-modal information retrieval tasks [12].…”
Section: The Network Of Clipmentioning
confidence: 99%
See 1 more Smart Citation
“…The problem of semantic misunderstanding has also been investigated by previous works. EI-CLIP [33] considers the problem of cross-modal retrieval in the field of E-commerce. Sharing similar insight with our work, the authors notice the model bias towards some specific word tokens in CLIP, and introduce causal inference to align the text encoder with e-commerce domain knowledge.…”
Section: Related Workmentioning
confidence: 99%
“…In our case, we are facing two challenges in building the fashion-domain MTL model: (1) Architecturally, it is non-trivial to model the diverse tasks in one unified architecture. Taking the popular CLIP [60] as an example, its two-stream architecture is designed for image-text alignment [52] and thus lacks the modality fusion mechanism as required by many V+L fashion tasks (e.g., text-guided image retrieval [2,83] and image captioning [85]). (2) In terms of optimization, a fashion-domain MTL model is prone to the notorious negative transfer problem [8,13,36,46,56,63] due to both task input/output format differences and imbalanced dataset sizes.…”
Section: Introductionmentioning
confidence: 99%