Structure-Aware Procedural Text Generation From an Image Sequence

Nishimura, Taichi; Hashimoto, Akihiro; Ushiku, Yoshitaka; Kameko, Hirotaka; Yamakata, Yoko; Mori, Shinsuke

doi:10.1109/access.2020.3043452

Cited by 18 publications

(4 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Video captioning algorithms were also used to extract recipes from an uncut video (Nishimura, Hashimoto, et al, 2022). This sort of system can also be made using off-the-shelf vision models and implemented into a robotic chef system (Sochacki, Abdulali, Hosseini, et al, 2023).…”

Section: Recipes Understanding and Learningmentioning

confidence: 99%

“…Robotic Chef was shown to translate a recipe into actions using a hard‐coded text analysis logic and follow them (Bollini et al, 2013). Video captioning algorithms were also used to extract recipes from an uncut video (Nishimura, Hashimoto, et al, 2022). This sort of system can also be made using off‐the‐shelf vision models and implemented into a robotic chef system (Sochacki, Abdulali, Hosseini, et al, 2023).…”

Section: Emerging Technologiesmentioning

confidence: 99%

See 1 more Smart Citation

Towards practical robotic chef: Review of relevant work and future challenges

Sochacki,

Zhang,

Abdulali

et al. 2024

Journal of Field Robotics

View full text Add to dashboard Cite

Robotic chefs are a promising technology that can improve the availability of quality food by reducing the time required for cooking, therefore decreasing food's overall cost. This paper clarifies and structures design and benchmarking rules in this new area of research, and provides a comprehensive review of technologies suitable for the construction of cooking robots. The diner is an ultimate judge of the cooking outcome, therefore we put focus on explaining human food preferences and perception of taste and ways to use them for control. Mechanical design of robotic chefs at a practically low cost remains the challenge, but some recently published gripper designs as well as whole robotic systems show the use of cheap materials or off‐the‐shelf components. Moreover, technologies like taste sensing, machine learning, and computer vision are making their way into robotic cooking enabling smart sensing and therefore improving controllability and autonomy. Furthermore, objective assessment of taste and food palatability is a challenge even for trained humans, therefore the paper provides a list of procedures for benchmarking the robot's tasting and cooking abilities. The paper is written from the point of view of a researcher or engineer building a practical robotic system, therefore there is a strong priority for solutions and technologies that are proven, robust and self‐contained enough to be a part of a larger system.

show abstract

Section: Recipes Understanding and Learningmentioning

confidence: 99%

Section: Emerging Technologiesmentioning

confidence: 99%

Towards practical robotic chef: Review of relevant work and future challenges

Sochacki,

Zhang,

Abdulali

et al. 2024

Journal of Field Robotics

View full text Add to dashboard Cite

show abstract

“…This task is more complex than regular image captioning [15,35] due to the difficulty in decoding long recipe texts. Nishimura et al [22,23] approached this problem by generating instructions from a sequence of images. Wang et al [36] first estimate the intermediate tree-structured representation of cooking instructions from an image, and generate full sentences from it.…”

Section: Cross-modal Synthesismentioning

confidence: 99%

Cross-modal Retrieval and Synthesis (X-MRS): Closing the Modality Gap in Shared Subspace Learning

Pham

Pavlović

2021

Proceedings of the 29th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Computational food analysis (CFA) naturally requires multi-modal evidence of a particular food, e.g., images, recipe text, etc. A key to making CFA possible is multi-modal shared representation learning, which aims to create a joint representation of the multiple views (text and image) of the data. In this work we propose a method for food domain cross-modal shared representation learning that preserves the vast semantic richness present in the food data. Our proposed method employs an effective transformer-based multilingual recipe encoder coupled with a traditional image embedding architecture. Here, we propose the use of imperfect multilingual translations to effectively regularize the model while at the same time adding functionality across multiple languages and alphabets. Experimental analysis on the public Recipe1M dataset shows that the representation learned via the proposed method significantly outperforms the current state-of-the-arts (SOTA) on retrieval tasks. Furthermore, the representational power of the learned representation is demonstrated through a generative food image synthesis model conditioned on recipe embeddings. Synthesized images can effectively reproduce the visual appearance of paired samples, indicating that the learned representation captures the joint semantics of both the textual recipe and its visual content, thus narrowing the modality gap. CCS CONCEPTS• Information systems → Multimedia and multimodal retrieval; • Computing methodologies → Learning latent representations.

show abstract

“…By verbalizing the task contents, a task procedure can be materialized and the reproducibility of the same work by humans and robots can be improved. Previously, Nishimura et al [11] generated instructions by verbalizing a series of task information from a series of cooking images. Erdal et al [2] proposed a framework that could automatically describe task motions from demonstration videos of human tasks.…”

Section: Related Workmentioning

confidence: 99%

Assembly Planning by Recognizing a Graphical Instruction Manual

Sera¹,

Yamanobe²,

Ramirez-Alpizar³

et al. 2021

Preprint

View full text Add to dashboard Cite

This paper proposes a robot assembly planning method by automatically reading the graphical instruction manuals design for humans. Essentially, the method generates an Assembly Task Sequence Graph (ATSG) by recognizing a graphical instruction manual. An ATSG is a graph describing the assembly task procedure by detecting types of parts included in the instruction images, completing the missing information automatically, and correcting the detection errors automatically. To build an ATSG, the proposed method first extracts the information of the parts contained in each image of the graphical instruction manual. Then, by using the extracted part information, it estimates the proper work motions and tools for the assembly task. After that, the method builds an ATSG by considering the relationship between the previous and following images, which makes it possible to estimate the undetected parts caused by occlusion using the information of the entire image series. Finally, by collating the total number of each part with the generated ATSG, the excess or deficiency of parts are investigated, and task procedures are removed or added according to those parts. In the experiment section, we build an ATSG using the proposed method to a graphical instruction manual for a chair and demonstrate the action sequences found in the ATSG can be performed by a dualarm robot execution. The results show the proposed method is effective and simplifies robot teaching in automatic assembly.

show abstract

Structure-Aware Procedural Text Generation From an Image Sequence

Cited by 18 publications

References 27 publications

Towards practical robotic chef: Review of relevant work and future challenges

Towards practical robotic chef: Review of relevant work and future challenges

Cross-modal Retrieval and Synthesis (X-MRS): Closing the Modality Gap in Shared Subspace Learning

Assembly Planning by Recognizing a Graphical Instruction Manual

Contact Info

Product

Resources

About