Transformer Decoders with MultiModal Regularization for Cross-Modal Food Retrieval

Shukor, Mustafa; Couairon, Guillaume; Grechka, Asya; Cord, Matthieu

doi:10.1109/cvprw56347.2022.00503

Cited by 17 publications

(9 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Similar to the majority of previous studies [9,29,44], we sample 1 K and 10 K imagerecipe pairs from the test partition and assess the retrieval performance for image-to-recipe and and recipe-to-image tasks using median rank (MedR) and recall rate at top k (R@k). Among these metrics, MedR represents the median index of the retrieved samples for each query, measuring the ability of models to understand the semantic correlation between two modalities and the accuracy of retrieval.…”

Section: Evaluation Criteriamentioning

confidence: 99%

“…Crowdsourcing strategy is also used to construct program representations of recipes [41]. Thanks to the flourishing development of visual language pre-training recently, some pioneers [42][43][44][45] have further embedded complex semantic relationship information into common feature subspace by leveraging the pre-trained Contrastive Language-Image Pre-Training model (CLIP).…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Disambiguity and Alignment: An Effective Multi-Modal Alignment Method for Cross-Modal Recipe Retrieval

Zou,

Zhu,

Zhu

et al. 2024

Foods

View full text Add to dashboard Cite

As a prominent topic in food computing, cross-modal recipe retrieval has garnered substantial attention. However, the semantic alignment across food images and recipes cannot be further enhanced due to the lack of intra-modal alignment in existing solutions. Additionally, a critical issue named food image ambiguity is overlooked, which disrupts the convergence of models. To these ends, we propose a novel Multi-Modal Alignment Method for Cross-Modal Recipe Retrieval (MMACMR). To consider inter-modal and intra-modal alignment together, this method measures the ambiguous food image similarity under the guidance of their corresponding recipes. Additionally, we enhance recipe semantic representation learning by involving a cross-attention module between ingredients and instructions, which is effective in supporting food image similarity measurement. We conduct experiments on the challenging public dataset Recipe1M; as a result, our method outperforms several state-of-the-art methods in commonly used evaluation criteria.

show abstract

Section: Evaluation Criteriamentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Disambiguity and Alignment: An Effective Multi-Modal Alignment Method for Cross-Modal Recipe Retrieval

Zou,

Zhu,

Zhu

et al. 2024

Foods

View full text Add to dashboard Cite

show abstract

“…For recipes, along with the proposed of various attention mechanism, Bert [44]- [46] and Transformer [47]- [50] are utilized to implement stronger textual encoder than sequential models such as skip-thought [26], [51] and LSTM [23]- [25] involved in early works. Furthermore, cross-modal attention [34], [52], [53] mechanism and large vision-language pre-training models [54]- [58] are employed in cross-modal recipe understanding, which further narrow heterogeneous gap via cross-modal interaction. The difficulty of recipe retrieval mainly stems from complex recipe sample including title, ingredients and instructions, other than a simple phrase or a sentence.…”

Section: A Cross-modal Recipe Retrievalmentioning

confidence: 99%

“…Following prior works [22], [34], [51], we evaluate the retrieval performance (both image-to-recipe task and recipe-toimage task) using median rank (MedR), which is the median index of the retrieved samples for each query, and recall rate at top-k, representing the percentage of queries for which the correct sample index belongs to the top-k retrieved samples.…”

Section: ) Metricsmentioning

confidence: 99%

“…Object detection [26] and image reconstruction [27]- [29] techniques are adopted by previous works to implement a strong-sighted model so that the key visual details can be focused. To better understand the complex semantics from recipes, some works [26], [30] devote to find the key terms in texts, while others [22], [31] attempt to explore the hidden consistent information between different components in recipes, or even capture the interaction of two modalities via cross-modal attention [32]- [34] to enhance cross-modal alignment [35]- [38].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

CREAMY: Cross-Modal Recipe Retrieval By Avoiding Matching Imperfectly

Zou,

Zhu,

Zhu

et al. 2024

IEEE Access

View full text Add to dashboard Cite

State-of-the-art methods for cross-modal recipe retrieval failed to consider an underlying but challenging issue, i.e., matching imperfectly problem hidden in positive image-recipe pairs, which is a culprit causing over-fitting. To make up this defect, two critical questions-how to effectively recognize and filter out mismatching parts during the model training and how to pick out and preserve as much matching information as possible need to be answered. To do so, this article proposes a novel method-Cross-modal Recipe rEtrieval by Avoiding Matching imperfectlY, abbreviated as CREAMY, which involving a new-designed learning strategy called Non-Matching and Partial-Matching (NMPM) to undertake two tasks: (1) no longer forcibly aligning each positive image-recipe pair but rather capturing the complementary information from negative pairs; (2) delicately picking up and aligning the matchable part in each pair. To the best of our knowledge, this attempt is a pioneer to defeat the matching imperfectly issue for cross-modal recipe retrieval task. Empirical analysis conducted on Recipe1M dataset validates the advantages of CREAMY over several state-of-the-arts. The code is available at: https://github.com/users/pouqual/CREAMY.

show abstract

SSM: Semantic Selection and Multi-view Alignment for Image-Text Retrieval

Yu,

Yang,

et al. 2023

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Transformer Decoders with MultiModal Regularization for Cross-Modal Food Retrieval

Cited by 17 publications

References 35 publications

Disambiguity and Alignment: An Effective Multi-Modal Alignment Method for Cross-Modal Recipe Retrieval

Disambiguity and Alignment: An Effective Multi-Modal Alignment Method for Cross-Modal Recipe Retrieval

CREAMY: Cross-Modal Recipe Retrieval By Avoiding Matching Imperfectly

SSM: Semantic Selection and Multi-view Alignment for Image-Text Retrieval

Contact Info

Product

Resources

About