2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) 2022
DOI: 10.1109/cvprw56347.2022.00503
|View full text |Cite
|
Sign up to set email alerts
|

Transformer Decoders with MultiModal Regularization for Cross-Modal Food Retrieval

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
9
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 17 publications
(9 citation statements)
references
References 35 publications
0
9
0
Order By: Relevance
“…Similar to the majority of previous studies [9,29,44], we sample 1 K and 10 K imagerecipe pairs from the test partition and assess the retrieval performance for image-to-recipe and and recipe-to-image tasks using median rank (MedR) and recall rate at top k (R@k). Among these metrics, MedR represents the median index of the retrieved samples for each query, measuring the ability of models to understand the semantic correlation between two modalities and the accuracy of retrieval.…”
Section: Evaluation Criteriamentioning
confidence: 99%
See 1 more Smart Citation
“…Similar to the majority of previous studies [9,29,44], we sample 1 K and 10 K imagerecipe pairs from the test partition and assess the retrieval performance for image-to-recipe and and recipe-to-image tasks using median rank (MedR) and recall rate at top k (R@k). Among these metrics, MedR represents the median index of the retrieved samples for each query, measuring the ability of models to understand the semantic correlation between two modalities and the accuracy of retrieval.…”
Section: Evaluation Criteriamentioning
confidence: 99%
“…Crowdsourcing strategy is also used to construct program representations of recipes [41]. Thanks to the flourishing development of visual language pre-training recently, some pioneers [42][43][44][45] have further embedded complex semantic relationship information into common feature subspace by leveraging the pre-trained Contrastive Language-Image Pre-Training model (CLIP).…”
Section: Introductionmentioning
confidence: 99%
“…For recipes, along with the proposed of various attention mechanism, Bert [44]- [46] and Transformer [47]- [50] are utilized to implement stronger textual encoder than sequential models such as skip-thought [26], [51] and LSTM [23]- [25] involved in early works. Furthermore, cross-modal attention [34], [52], [53] mechanism and large vision-language pre-training models [54]- [58] are employed in cross-modal recipe understanding, which further narrow heterogeneous gap via cross-modal interaction. The difficulty of recipe retrieval mainly stems from complex recipe sample including title, ingredients and instructions, other than a simple phrase or a sentence.…”
Section: A Cross-modal Recipe Retrievalmentioning
confidence: 99%
“…Following prior works [22], [34], [51], we evaluate the retrieval performance (both image-to-recipe task and recipe-toimage task) using median rank (MedR), which is the median index of the retrieved samples for each query, and recall rate at top-k, representing the percentage of queries for which the correct sample index belongs to the top-k retrieved samples.…”
Section: ) Metricsmentioning
confidence: 99%
See 1 more Smart Citation