2020
DOI: 10.48550/arxiv.2003.10066
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Caption Generation of Robot Behaviors based on Unsupervised Learning of Action Segments

Abstract: Bridging robot action sequences and their natural language captions is an important task to increase explainability of human assisting robots in their recently evolving field. In this paper, we propose a system for generating natural language captions that describe behaviors of human assisting robots. The system describes robot actions by using robot observations; histories from actuator systems and cameras, toward end-to-end bridging between robot actions and natural language captions. Two reasons make it cha… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2024
2024
2024
2024

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(3 citation statements)
references
References 24 publications
0
3
0
Order By: Relevance
“…One possible solution is transferring the data collected in a simulated world to the real world (sim2real); however, transferring the knowledge acquired in simulated worlds remains a challenge [28]. Effective feature extraction must be investigated so that systems can work in actual situations [29].…”
Section: Data Scalabilitymentioning
confidence: 99%
See 2 more Smart Citations
“…One possible solution is transferring the data collected in a simulated world to the real world (sim2real); however, transferring the knowledge acquired in simulated worlds remains a challenge [28]. Effective feature extraction must be investigated so that systems can work in actual situations [29].…”
Section: Data Scalabilitymentioning
confidence: 99%
“…In this study, we face a difficulty to collect a large amount of data because we use images that assume a robot's firstperson viewpoints in certain environments. Abstracting the dataset is critical for effectively using such a small amount of data as training data [29]. Feature extraction methods, with pre-trained models trained on large-scale data, have been widely used in recent years; however, more focused information is required to understand visual situations.…”
Section: Annotating Multimodal Featuresmentioning
confidence: 99%
See 1 more Smart Citation