Proceedings of the 2021 International Conference on Multimedia Retrieval 2021
DOI: 10.1145/3460426.3463618
|View full text |Cite
|
Sign up to set email alerts
|

Cross-Modal Image-Recipe Retrieval via Intra- and Inter-Modality Hybrid Fusion

Abstract: In recent years, the Internet has stimulated the explosion of multimedia data. Food-related cooking videos, images, and recipes promote the rapid development of food computing. Image-recipe retrieval is an important sub-task in the field of cross-modal retrieval, which focuses on the measurement of the association between food image and recipe (title, ingredients, instructions). Although the existing methods have proposed some feasible solutions to achieve the goal of Image-recipe retrieval, there are still th… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
11
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 14 publications
(11 citation statements)
references
References 43 publications
0
11
0
Order By: Relevance
“…To this end, several well-known CV models and NLP models are employed to generate highquality embeddings from food images and recipe texts so as to achieve cross-modal alignment. For example, deep convolutional neural networks, such as VGG [17], [39], [40] and ResNet [18], [19], [41], [42], are used in several works for visual information embedding. To further focus on essential visual features, Faster-R-CNN is involved [43] to detect food object.…”
Section: A Cross-modal Recipe Retrievalmentioning
confidence: 99%
See 1 more Smart Citation
“…To this end, several well-known CV models and NLP models are employed to generate highquality embeddings from food images and recipe texts so as to achieve cross-modal alignment. For example, deep convolutional neural networks, such as VGG [17], [39], [40] and ResNet [18], [19], [41], [42], are used in several works for visual information embedding. To further focus on essential visual features, Faster-R-CNN is involved [43] to detect food object.…”
Section: A Cross-modal Recipe Retrievalmentioning
confidence: 99%
“…For recipes, along with the proposed of various attention mechanism, Bert [44]- [46] and Transformer [47]- [50] are utilized to implement stronger textual encoder than sequential models such as skip-thought [26], [51] and LSTM [23]- [25] involved in early works. Furthermore, cross-modal attention [34], [52], [53] mechanism and large vision-language pre-training models [54]- [58] are employed in cross-modal recipe understanding, which further narrow heterogeneous gap via cross-modal interaction. The difficulty of recipe retrieval mainly stems from complex recipe sample including title, ingredients and instructions, other than a simple phrase or a sentence.…”
Section: A Cross-modal Recipe Retrievalmentioning
confidence: 99%
“…et al [33] proposes a recipe encoder based on hierarchical Transformers. Besides, there are some works [5,22,23,52] resort to the Transformer or multi-head self-attention mechanism to learn the direct information interaction between the recipe text and food image, but in this way, the learned embeddings have to be on-the-fly generated and can not be indexed offline for retrieval. However, few of these existing methods focus on the event features in the modality-specific embedding learning and joint embedding learning optimization to enhance the cross-modal alignment and boost the event-dense cross-modal retrieval performance.…”
Section: Instructionsmentioning
confidence: 99%
“…Baselines for Comparison. Fifteen baselines are considered: JESR [34], AMSR [4], AdaMine [2], R 2 GAN [53], ACME [41], MCEN [12], SN [52], CHEF [28], IMHF [22], SCAN [42], RDE-GAN [37], X-MRS [14], HF-ICMA [23], JEMA [48] and Pair+rec [33] .…”
Section: Size Of Test-set Approachesmentioning
confidence: 99%
See 1 more Smart Citation