FashionBERT: Text and Image Matching with Adaptive Loss for Cross-modal Retrieval

Gao, Dehong; Jin, Linbo; Chen, Ben; Qiu, Minghui; Li, Peng; Yi, Wei; Hu, Yi; Wang, Hao

doi:10.1145/3397271.3401430

Cited by 98 publications

(77 citation statements)

References 46 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…There are several directions to extend the current work in the future, including (1) considering jointly modeling texts and images in one Transformer model like FashionBERT (Gao et al, 2020), and (2) using self-training to go beyond the limit caused by the size of labeled image data for the image model.…”

Section: Discussionmentioning

confidence: 99%

Multimodal Item Categorization Fully Based on Transformer

Chen¹,

Chou²,

Xia³

et al. 2021

Proceedings of the 4th Workshop on E-Commerce and NLP

View full text Add to dashboard Cite

The Transformer has proven to be a powerful feature extraction method and has gained widespread adoption in natural language processing (NLP). In this paper we propose a multimodal item categorization (MIC) system solely based on the Transformer for both text and image processing. On a multimodal product data set collected from a Japanese ecommerce giant, we tested a new image classification model based on the Transformer and investigated different ways of fusing bi-modal information. Our experimental results on real industry data showed that the Transformerbased image classifier has performance on par with ResNet-based classifiers and is four times faster to train. Furthermore, a cross-modal attention layer was found to be critical for the MIC system to achieve performance gains over text-only and image-only models.

show abstract

Section: Discussionmentioning

confidence: 99%

Multimodal Item Categorization Fully Based on Transformer

Chen¹,

Chou²,

Xia³

et al. 2021

Proceedings of the 4th Workshop on E-Commerce and NLP

View full text Add to dashboard Cite

show abstract

“…The latest researches show that self-attention-based architectures, especially transformers [45], have achieved great success in the field of natural language processing. Inspired by this achievement, many researchers have applied transformers to help solve cross-modal retrieval tasks [16,27,30,53].…”

Section: Cross-modal Retrievalmentioning

confidence: 99%

Cross-Modal Image-Recipe Retrieval via Intra- and Inter-Modality Hybrid Fusion

Sun

et al. 2021

Proceedings of the 2021 International Conference on Multimedia Retrieval

View full text Add to dashboard Cite

In recent years, the Internet has stimulated the explosion of multimedia data. Food-related cooking videos, images, and recipes promote the rapid development of food computing. Image-recipe retrieval is an important sub-task in the field of cross-modal retrieval, which focuses on the measurement of the association between food image and recipe (title, ingredients, instructions). Although the existing methods have proposed some feasible solutions to achieve the goal of Image-recipe retrieval, there are still the following issues: 1) complex model structure and time-consuming training process. 2) the lack of information interaction within modalities and information integration between images and recipes. To this end, we propose a novel lightweight framework named Intra-and Inter-Modality Hybrid Fusion (IMHF). Our IMHF model abandons a separate deep vision encoder and utilizes the transformer module to unify the visual and text features. In this way, valuable information from images and recipes can be condensed and the direct information interaction between the two modalities can be promoted. Both the intra-and inter-modality fusion can be realized. Extensive experiment results on the large-scale benchmark dataset Recipe1M demonstrate that our model IMHF with a lightweight architecture is superior to the state-of-the-art approaches. CCS CONCEPTS• Information systems → Information retrieval; • Computing methodologies → Visual content-based indexing and retrieval.

show abstract

“…Recently, with the advent of graph neural network, many methods based on this new paradigm have been proposed to learn graph representations on heterogeneous graphs, such as Heterogeneous Graph Neural Network (HetGNN) , Heterogeneous Graph Attention Network (HAN) (Wang et al, 2019c), and Heterogeneous Graph Transformer (HGT) (Hu et al, 2020). effectively increase the purchase rate of the top ranked products.…”

Section: Heterogeneous Networkmentioning

confidence: 99%

Proceedings of The 4th Workshop on e-Commerce and NLP

2021

View full text Add to dashboard Cite

Word embeddings (e.g., word2vec) have been applied successfully to eCommerce products through prod2vec. Inspired by the recent performance improvements on several NLP tasks brought by contextualized embeddings, we propose to transfer BERT-like architectures to eCommerce: our model -Prod2BERT -is trained to generate representations of products through masked session modeling. Through extensive experiments over multiple shops, different tasks, and a range of design choices, we systematically compare the accuracy of Prod2BERT and prod2vec embeddings: while Prod2BERT is found to be superior in several scenarios, we highlight the importance of resources and hyperparameters in the best performing models. Finally, we provide guidelines to practitioners for training embeddings under a variety of computational and data constraints. * Federico and Bingqing contributed equally to this research. † Corresponding author. 10 Costs are from official AWS pricing, with 0.10 USD/h for the c4.large (https://aws.amazon.com/ it/ec2/pricing/on-demand/), and 12,24 USD/h for the p3.8xlarge (https://aws.amazon.com/it/ec2/ instance-types/p3/). While obviously cost optimizations are possible, the "naive" pricing is a good proxy to appreciate the difference between the two methods. Ethical ConsiderationsUser data has been collected by Coveo in the process of providing business services: data is collected and processed in an anonymized fashion, in compliance with existing legislation. In particular, the target dataset uses only anonymous uuids to label events and, as such, it does not contain any information that can be linked to physical entities. ReferencesSamar Al-Saqqa and Arafat Awajan. 2019. The use of word2vec model in sentiment analysis: A survey. In Proceedings of the 2019 International Conference on Artificial Intelligence, Robotics and Control, pages 39-43.

show abstract

FashionBERT: Text and Image Matching with Adaptive Loss for Cross-modal Retrieval

Cited by 98 publications

References 46 publications

Multimodal Item Categorization Fully Based on Transformer

Multimodal Item Categorization Fully Based on Transformer

Cross-Modal Image-Recipe Retrieval via Intra- and Inter-Modality Hybrid Fusion

Proceedings of The 4th Workshop on e-Commerce and NLP

Contact Info

Product

Resources

About