Deep Understanding of Cooking Procedure for Cross-modal Recipe Retrieval

Chen, Jingjing; Ngo, Chong‐Wah; Feng, Fuli; Chua, Tat-Seng

doi:10.1145/3240508.3240627

Cited by 100 publications

(144 citation statements)

References 36 publications

Supporting

Mentioning

142

Contrasting

Order By: Relevance

“…The resource types include image , recipe , food title , title and ingredients , and recipe-image pair ( , ). The complete training data refers to the set of recipe-image pairs for fully supervised model training [10,33,36]. In the remaining sections, we abbreviate the source and target domains with the superscripts and respectively.…”

Section: Cross-domain Food Transfermentioning

confidence: 99%

See 1 more Smart Citation

Cross-domain Cross-modal Food Transfer

Zhu

Ngo

Chen

2020

Proceedings of the 28th ACM International Conference on Multimedia

Self Cite

View full text Add to dashboard Cite

The recent works in cross-modal image-to-recipe retrieval pave a new way to scale up food recognition. By learning the joint space between food images and recipes, food recognition is boiled down as a retrieval problem by evaluating the similarity of embedded features. The major drawback, nevertheless, is the difficulty in applying an already-trained model to recognize different cuisines of dishes unknown to the model. In general, model updating with new training examples, in the form of image-recipe pairs, is required to adapt a model to new cooking styles in a cuisine. Nevertheless, in practice, acquiring sufficient number of image-recipe pairs for model transfer can be time-consuming. This paper addresses the challenge of resource scarcity in the scenario that only partial data instead of a complete view of data is accessible for model transfer. Partial data refers to missing information such as absence of image modality or cooking instructions from an image-recipe pair. To cope with partial data, a novel generic model, equipped with various loss functions including cross-modal metric learning, recipe residual loss, semantic regularization and adversarial learning, is proposed for cross-domain transfer learning. Experiments are conducted on three different cuisines (Chuan, Yue and Washoku) to provide insights on scaling up food recognition across domains with limited training resources. CCS CONCEPTS • Information systems → Multimedia and multimodal retrieval.

show abstract

Section: Cross-domain Food Transfermentioning

confidence: 99%

“…Based upon these prior works [4,7,9,29,33,36], this paper extends from cross-modal to cross-domain food retrieval. Leveraging on image-recipe pairs in a source domain, we consider the problem of food transfer as recognizing food in a target domain with new food categories and attributes.…”

Section: Introductionmentioning

confidence: 99%

Cross-domain Cross-modal Food Transfer

Zhu

Ngo

Chen

2020

Proceedings of the 28th ACM International Conference on Multimedia

Self Cite

View full text Add to dashboard Cite

show abstract

“…Recipe1M [27] is the only large-scale food dataset with English recipes and images publicly available. Many related works [6,26,27,32] are based on this dataset. The raw dataset contains more than 1 million recipes and almost 900k images.…”

Section: Experiments 41 Datasetsmentioning

confidence: 99%

“…People tend to spend much time on recipes because cooking is closely related to our life. Lots of words have been done to deconstruct and understand food, including food classification [8,16], recipe-image embedding [6,27,32] and image-to-recipe generation [26]. Furthermore, dish appearance visualization in advance will be of great help for designing new recipes, which provides evident significance to image generation from given recipes.…”

Section: Introductionmentioning

confidence: 99%

ChefGAN

Pan

Dai

Hou

et al. 2020

Proceedings of the 28th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Although significant progress has been made in generating images from the text by using generative adversarial networks (GANs), it is still challenging to deal with long text, which contains complex semantic information like recipes. This paper focuses on generating images with high visual realism and semantic consistency from the complex text of recipes. To achieve this, we propose a GANs based method termed ChefGAN. The critical concept of ChefGAN is that a joint image-recipe embedding model is used before the generation task to provide high-quality representations of recipes, and it acts as an extra regularization during the generation to improve semantic consistency. Two modules are designed for this image text embedding module (ITEM) and a cascaded image generation module (CIGM). The generation process is carried out in 3 steps: (1) Two encoders in ITEM are trained simultaneously to generate similar representations for each image-recipe pair. (2) CIGM generates images according to the representations from ITEM's text encoder. (3) The generated image is fed into ITEM's image encoder to calculate the similarity with the given recipe. This process can provide additional regularization effect other than the impact of a discriminator. To facilitate convergence, we applied a two-stage training strategy, which generates an image with low resolution and then one with high resolution in the CIGM module. Compared with other representative state-of-the-art methods, ChefGAN demonstrates better performance both in visual realism and semantic consistency. CCS CONCEPTS • Information systems → Multimedia content creation; • Computing methodologies → Computer vision representations.

show abstract

“…Deriving a joint representation from different modalities associated with a multimedia item has been a long-standing research question in cross-media retrieval [6,19,30,34,37]. The main idea behind such approaches is learning a common space to which different modalities, usually text and visual, can be mapped and directly compared.…”

Section: Social Multimedia Representationmentioning

confidence: 99%

Interactive Search and Exploration in Discussion Forums Using Multimodal Embeddings

Gornishka

Rudinac

Worring

2019

MultiMedia Modeling

View full text Add to dashboard Cite

In this paper we present a novel interactive multimodal learning system, which facilitates search and exploration in large networks of social multimedia users. It allows the analyst to identify and select users of interest, and to find similar users in an interactive learning setting. Our approach is based on novel multimodal representations of users, words and concepts, which we simultaneously learn by deploying a general-purpose neural embedding model. We show these representations to be useful not only for categorizing users, but also for automatically generating user and community profiles. Inspired by traditional summarization approaches, we create the profiles by selecting diverse and representative content from all available modalities, i.e. the text, image and user modality. The usefulness of the approach is evaluated using artificial actors, which simulate user behavior in a relevance feedback scenario. Multiple experiments were conducted in order to evaluate the quality of our multimodal representations, to compare different embedding strategies, and to determine the importance of different modalities. We demonstrate the capabilities of the proposed approach on two different multimedia collections originating from the violent online extremism forum Stormfront and the microblogging platform Twitter, which are particularly interesting due to the high semantic level of the discussions they feature. CCS CONCEPTS• Information systems → Multimedia and multimodal retrieval. KEYWORDS multimedia analytics, search, exploration, interactive learning, multimodal embeddings, online discussion forums, social multimedia • First, compact but meaningful multimodal content representations are needed to ensure the interactivity of the system

show abstract

Deep Understanding of Cooking Procedure for Cross-modal Recipe Retrieval

Cited by 100 publications

References 36 publications

Cross-domain Cross-modal Food Transfer

Cross-domain Cross-modal Food Transfer

ChefGAN

Interactive Search and Exploration in Discussion Forums Using Multimodal Embeddings

Contact Info

Product

Resources

About