2022
DOI: 10.48550/arxiv.2202.07247
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

CommerceMM: Large-Scale Commerce MultiModal Representation Learning with Omni Retrieval

Abstract: We introduce CommerceMM -a multimodal model capable of providing a diverse and granular understanding of commerce topics associated to the given piece of content (image, text, image+text), and having the capability to generalize to a wide range of tasks, including Multimodal Categorization, Image-Text Retrieval, Queryto-Product Retrieval, Image-to-Product Retrieval, etc. We follow the pre-training + fine-tuning training regime and present 5 effective pre-training tasks on image-text pairs. To embrace more comm… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
4
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
2
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(4 citation statements)
references
References 32 publications
0
4
0
Order By: Relevance
“…The basic architecture of vision-language pre-training (VLP) models is usually composed of a visual embedding module, a textual embedding module and a fusion encoder [16,17]. Of these architectures the image is usually encoded with an off-the-shelf ResNet [5], or Faster-RCNN [13], or Visual Transformer [3] model.…”
Section: Contrastive Loss Unitmentioning
confidence: 99%
“…The basic architecture of vision-language pre-training (VLP) models is usually composed of a visual embedding module, a textual embedding module and a fusion encoder [16,17]. Of these architectures the image is usually encoded with an off-the-shelf ResNet [5], or Faster-RCNN [13], or Visual Transformer [3] model.…”
Section: Contrastive Loss Unitmentioning
confidence: 99%
“…The newer methods in this domain uses graph neural networks (GNNs) (Liu et al, 2023a(Liu et al, , 2023b for feature aggregation by improving the embeddings, multi-modal representation learning with omni retrieval from text and images (Yu et al, 2022) and performing classification using Bayesian inference for GNNs (Liu et al, 2022). However, the goal of this study is to present a novel architecture that can improve the performance of BERT model for sentence representation learning owing to text classification.…”
Section: Related Workmentioning
confidence: 99%
“…In these schemes, the basic instance is directly the inputted image. To better understand multi-media information like advertisement or commerce topics, recent works [20,32,43] augment image-text pairs or even a flexible combination of text, image and other valuable data like query and click. Com-pared with approaches that exploit the raw input (e.g., image, image-text pairs), the basic instance in DocReL is a semantic entity.…”
Section: Pre-training Tasks In Vrdsmentioning
confidence: 99%
“…(a)) or visual-text pairs[32,43] (Fig. 2 (b)), VRDs offer multi-modal triplets that consist of visual, text, and layout, as shown in Fig.…”
mentioning
confidence: 99%