Symbol Emergence in Robotics: A Survey

Taniguchi, Tadahiro; Nagai, Takayuki; Nagaoka, Tomoaki; Iwahashi, Naoto; Ogata, Tetsuya; Asoh, Hideki

doi:10.48550/arxiv.1509.08973

Cited by 2 publications

(2 citation statements)

References 141 publications

(202 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Such methods can, for instance, make it possible to search through a corpus of unlabelled speech using voice queries [1], allow topics within speech utterances to be identified without supervision [2], or can be used to automatically cluster related spoken documents [3]. Similar techniques are required to model how human infants acquire language from speech input [4], and for developing robotic applications that can learn a new language in an unknown environment [5,6].…”

Section: Introductionmentioning

confidence: 99%

A segmental framework for fully-unsupervised large-vocabulary speech recognition

Kamper

Jansen

Goldwater

2017

Computer Speech & Language

128

View full text Add to dashboard Cite

Zero-resource speech technology is a growing research area that aims to develop methods for speech processing in the absence of transcriptions, lexicons, or language modelling text. Early term discovery systems focused on identifying isolated recurring patterns in a corpus, while more recent full-coverage systems attempt to completely segment and cluster the audio into word-like units-effectively performing unsupervised speech recognition. This article presents the first attempt we are aware of to apply such a system to large-vocabulary multi-speaker data. Our system uses a Bayesian modelling framework with segmental word representations: each word segment is represented as a fixed-dimensional acoustic embedding obtained by mapping the sequence of feature frames to a single embedding vector. We compare our system on English and Xitsonga datasets to state-of-the-art baselines, using a variety of measures including word error rate (obtained by mapping the unsupervised output to ground truth transcriptions). Very high word error rates are reported-in the order of 70-80% for speaker-dependent and 80-95% for speaker-independent systems-highlighting the difficulty of this task. Nevertheless, in terms of cluster quality and word segmentation metrics, we show that by imposing a consistent top-down segmentation while also using bottomup knowledge from detected syllable boundaries, both singlespeaker and multi-speaker versions of our system outperform a purely bottom-up single-speaker syllable-based approach. We also show that the discovered clusters can be made less speakerand gender-specific by using an unsupervised autoencoder-like feature extractor to learn better frame-level features (prior to embedding). Our system's discovered clusters are still less pure than those of unsupervised term discovery systems, but provide far greater coverage.

show abstract

Section: Introductionmentioning

confidence: 99%

A segmental framework for fully-unsupervised large-vocabulary speech recognition

Kamper

Jansen

Goldwater

2017

Computer Speech & Language

128

View full text Add to dashboard Cite

show abstract

“…The human capability for object categorization is a fundamental topic in cognitive science Barsalou (1999). In the field of robotics, adaptive formation of object categories that considers a robot's embodiment, i.e., its sensory-motor system, is gathering attention as a way to solve the symbol grounding problem Harnad (1990); Taniguchi et al (2015).…”

Section: Multimodal Categorizationmentioning

confidence: 99%

Multimodal Hierarchical Dirichlet Process-based Active Perception

Taniguchi,

Takano,

Yoshino

2015

Preprint

Self Cite

View full text Add to dashboard Cite

In this paper, we propose an active perception method for recognizing object categories based on the multimodal hierarchical Dirichlet process (MHDP). The MHDP enables a robot to form object categories using multimodal information, e.g., visual, auditory, and haptic information, which can be observed by performing actions on an object. However, performing many actions on a target object requires a long time. In a real-time scenario, i.e., when the time is limited, the robot has to determine the set of actions that is most effective for recognizing a target object. We propose an MHDP-based active perception method that uses the information gain (IG) maximization criterion and lazy greedy algorithm. We show that the IG maximization criterion is optimal in the sense that the criterion is equivalent to a minimization of the expected Kullback-Leibler divergence between a final recognition state and the recognition state after the next set of actions. However, a straightforward calculation of IG is practically impossible. Therefore, we derive an efficient Monte Carlo approximation method for IG by making use of a property of the MHDP. We also show that the IG has submodular and non-decreasing properties as a set function because of the structure of the graphical model of the MHDP. Therefore, the IG maximization problem is reduced to a submodular maximization problem. This means that greedy and lazy greedy algorithms are effective and have a theoretical justification for their performance. We conducted an experiment using an upper-torso humanoid robot and a second one using synthetic data. The experimental results show that the method enables the robot to select a set of actions that allow it to recognize target objects quickly and accurately. The results support our theoretical outcomes.

show abstract

Symbol Emergence in Robotics: A Survey

Cited by 2 publications

References 141 publications

A segmental framework for fully-unsupervised large-vocabulary speech recognition

A segmental framework for fully-unsupervised large-vocabulary speech recognition

Multimodal Hierarchical Dirichlet Process-based Active Perception

Contact Info

Product

Resources

About