Ingredient-Guided Cascaded Multi-Attention Network for Food Recognition

Min, Weiqing; Liu, Linhu; Luo, Z.P.; Jiang, Shuqiang

doi:10.1145/3343031.3350948

Cited by 99 publications

(51 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Request permissions from permissions@acm.org. MM '20, October 12-16, 2020 their attributes such as ingredients [2,6,23], cooking and cutting methods [8]. In practice, the number of categories can easily go beyond a thousand for a city-scale food dataset [24].…”

Section: Introductionmentioning

confidence: 99%

Cross-domain Cross-modal Food Transfer

Zhu

Ngo

Chen

2020

Proceedings of the 28th ACM International Conference on Multimedia

View full text Add to dashboard Cite

The recent works in cross-modal image-to-recipe retrieval pave a new way to scale up food recognition. By learning the joint space between food images and recipes, food recognition is boiled down as a retrieval problem by evaluating the similarity of embedded features. The major drawback, nevertheless, is the difficulty in applying an already-trained model to recognize different cuisines of dishes unknown to the model. In general, model updating with new training examples, in the form of image-recipe pairs, is required to adapt a model to new cooking styles in a cuisine. Nevertheless, in practice, acquiring sufficient number of image-recipe pairs for model transfer can be time-consuming. This paper addresses the challenge of resource scarcity in the scenario that only partial data instead of a complete view of data is accessible for model transfer. Partial data refers to missing information such as absence of image modality or cooking instructions from an image-recipe pair. To cope with partial data, a novel generic model, equipped with various loss functions including cross-modal metric learning, recipe residual loss, semantic regularization and adversarial learning, is proposed for cross-domain transfer learning. Experiments are conducted on three different cuisines (Chuan, Yue and Washoku) to provide insights on scaling up food recognition across domains with limited training resources. CCS CONCEPTS • Information systems → Multimedia and multimodal retrieval.

show abstract

Section: Introductionmentioning

confidence: 99%

Cross-domain Cross-modal Food Transfer

Zhu

Ngo

Chen

2020

Proceedings of the 28th ACM International Conference on Multimedia

View full text Add to dashboard Cite

show abstract

“…An appropriate sampling rate still needs to be investigated, if the dataset expands and includes more subjects with different eating speeds. Second, given the fact that all deep network models were only trained with weak supervision for food recognition (i.e., no bounding boxes or masks provided), although the results so far are reasonable, we conjecture that better results could be achieved by 1) labelling consumed food or all visible food items with bounding boxes or masks, or 2) using categorical labels or visual attention techniques to localize food items [33], [52]. Third, this work does not investigate bite size estimation.…”

Section: Discussionmentioning

confidence: 99%

“…Recently, efforts to assess individual dietary intake in communal eating scenarios have also been made with the use of a 360 camera [50], [51]. Fine-grained food ingredient recognition has also been studied to enhance general food recognition [52], or to perform recipe retrieval [53], but so far studies have only been carried out in recognizing ingredients from food images rather than from dietary intake videos.…”

Section: A Technological Approaches For Dietary Assessmentmentioning

confidence: 99%

Counting Bites and Recognizing Consumed Food from Videos for Passive Dietary Monitoring

Qiu

Jiang

et al. 2021

IEEE J. Biomed. Health Inform.

View full text Add to dashboard Cite

Assessing dietary intake in epidemiological studies are predominantly based on self-reports, which are subjective, inefficient, and also prone to error. Technological approaches are therefore emerging to provide objective dietary assessments. Using only egocentric dietary intake videos, this work aims to provide accurate estimation on individual dietary intake through recognizing consumed food items and counting the number of bites taken. This is different from previous studies that rely on inertial sensing to count bites, and also previous studies that only recognize visible food items but not consumed ones. As a subject may not consume all food items visible in a meal, recognizing those consumed food items is more valuable. A new dataset that has 1,022 dietary intake video clips was constructed to validate our concept of bite counting and consumed food item recognition from egocentric videos. 12 subjects participated and 52 meals were captured. A total of 66 unique food items, including food ingredients and drinks, were labelled in the dataset along with a total of 2,039 labelled bites. Deep neural networks were used to perform bite counting and food item recognition in an end-to-end manner. Experiments have shown that counting bites directly from video clips can reach 74.15% top-1 accuracy (classifying between 0-4 bites in 20-second clips), and a MSE value of 0.312 (when using regression). Our experiments on video-based food recognition also show that recognizing consumed food items is indeed harder than recognizing visible ones, with a drop of 25% in F1 score.

show abstract

“…Newly proposed system, the visual attention analysis, has shown that the network is able to the relevant portions of the image that should . Going further, the Attention Network art recognition 200 [7]. The recognition systems were done by learning process was based semantic network was constructed.…”

Section: Materials and Methodsology 31 Datasetmentioning

confidence: 99%

“…tion is gaining more attention in the multimedia community due to its various applications, e.g., log and personalized healthcare [7]. It is common that one dish can be served in several ways.…”

Section: Introductionmentioning

confidence: 99%