Caption Generation of Robot Behaviors based on Unsupervised Learning of Action Segments

Yoshino, Katsumi; Wakimoto, Kohei; Nishimura, Yoshiyuki; Nakamura, Satoshi

doi:10.48550/arxiv.2003.10066

Cited by 1 publication

(3 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…One possible solution is transferring the data collected in a simulated world to the real world (sim2real); however, transferring the knowledge acquired in simulated worlds remains a challenge [28]. Effective feature extraction must be investigated so that systems can work in actual situations [29].…”

Section: Data Scalabilitymentioning

confidence: 99%

“…In this study, we face a difficulty to collect a large amount of data because we use images that assume a robot's firstperson viewpoints in certain environments. Abstracting the dataset is critical for effectively using such a small amount of data as training data [29]. Feature extraction methods, with pre-trained models trained on large-scale data, have been widely used in recent years; however, more focused information is required to understand visual situations.…”

Section: Annotating Multimodal Featuresmentioning

confidence: 99%

“…We also can use recent pre-trained models as feature extractors from both the utterance and the image [8], [33] to utilize multimodal features. However, we need to abstract information that can be extracted from the multimodal input when the model is trained with a small amount of training data for selecting robot action categories [29]. We investigated whether the classification accuracy could be improved by a classifier that utilizes our descriptive features such as objects or user poses.…”

Section: Reflective Action Classifier Using Multimodal Featuresmentioning

confidence: 99%

See 2 more Smart Citations

Do as I Demand, Not as I Say: A Dataset for Developing a Reflective Life-Support Robot

Tanaka,

Yamasaki,

Yuguchi

et al. 2024

IEEE Access

View full text Add to dashboard Cite

Interactive robots that cooperate with humans must take appropriate actions in response to their requests. Unfortunately, such requests often have information gaps with their actual demands. However, robots are still expected to reason and act on what is required, depending on the situation. We call these reflective actions. To achieve such reflective actions for robots, we constructed a dataset that consists of the reflective actions of a domestic manipulation robot, in which the actions correspond to user utterances with their surroundings situations. By crowdsourcing, we defined several action scenarios that could be regarded as reflective. We recorded videos of situations described in the crowdsourcing scenarios, corresponding to the user situations just before the robot's reflective actions. We also annotated the videos of the user utterance transcriptions, objects, user poses, and user positions to investigate the contribution of such descriptive features to the reflective action decisions. Our experimental results indicated that even though our newly defined task is very challenging, it can be solved if the system has a concrete understanding of the situation.

show abstract

Section: Data Scalabilitymentioning

confidence: 99%

Section: Annotating Multimodal Featuresmentioning

confidence: 99%