Multi-modal Cooking Workflow Construction for Food Recipes

Pan, Liangming; Chen, Jingjing; Wu, Jian; Liu, Shaoteng; Ngo, Chong‐Wah; Kan, Min-Yen; Jiang, Yu‐Gang; Chua, Tat-Seng

doi:10.1145/3394171.3413765

Cited by 15 publications

(1 citation statement)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Although the above work has focused on parsing text-only recipes, little work has not addressed cross-modal analysis due to limited available datasets. Pan et al [11] recently created a novel cross-modal dataset, MM-ReS dataset. This dataset consists of recipes, image sequences, and annotated tree structures, allowing us to analyze the cause-and-effect relations between step texts and images in the recipe and image sequence.…”

Section: B Structure Estimation For Context Dependencymentioning

confidence: 99%

Structure-Aware Procedural Text Generation From an Image Sequence

et al. 2021

View full text Add to dashboard Cite

It is an important activity for our society to create new value by combining materials. From daily cooking to manufacturing for industry, we often describe the way to do it as a procedural text. As pointed by some previous studies for natural language understanding, one important property of the procedural text is its dependency of the context, which is the merging operations of materials and can be represented by a graph or tree structure. This paper aims to investigate the impact of explicitly introducing such a structure on the vision and language task of procedural text generation from an image sequence. To this end, we propose (1) a new dataset, which extends a definition of a tree structure merging tree to a vision and language version and (2) a novel structure-aware procedural text generation model, which learns the context dependency efficiently. Experimental results show that the proposed method can boost the performance of traditional versatile methods.INDEX TERMS Natural language processing, text generation, procedural text, vision and language. B. EXTENSION TO SIMMR: LARGER DATASET WITH VISUAL REFERENCEAs shown in Figure 1(b), we extend SIMMR to vSIMMR by annotating merging trees with image sequences. This

show abstract

Section: B Structure Estimation For Context Dependencymentioning

confidence: 99%