We summarize the accomplishments of a multi-disciplinary workshop exploring the computational and scientific issues surrounding zero resource (unsupervised) speech technologies and related models of early language acquisition. Centered around the tasks of phonetic and lexical discovery, we consider unified evaluation metrics, present two new approaches for improving speaker independence in the absence of supervision, and evaluate the application of Bayesian word segmentation algorithms to automatic subword unit tokenizations. Finally, we present two strategies for integrating zero resource techniques into supervised settings, demonstrating the potential of unsupervised methods to improve mainstream technologies.
Abstract. This paper addresses automatic skill assessment in robotic minimally invasive surgery. Hidden Markov models (HMMs) are developed for individual surgical gestures (or surgemes) that comprise a typical bench-top surgical training task. It is known that such HMMs can be used to recognize and segment surgemes in previously unseen trials [1]. Here, the topology of each surgeme HMM is designed in a data-driven manner, mixing trials from multiple surgeons with varying skill levels, resulting in HMM states that model skill-specific sub-gestures. The sequence of HMM states visited while performing a surgeme are therefore indicative of the surgeon's skill level. This expectation is confirmed by the average edit distance between the state-level "transcripts" of the same surgeme performed by two surgeons with different expertise levels. Some surgemes are further shown to be more indicative of skill than others.
Accurate unsupervised learning of phonemes of a language directly from speech is demonstrated via an algorithm for joint unsupervised learning of the topology and parameters of a hidden Markov model (HMM); states and short state-sequences through this HMM correspond to the learnt sub-word units. The algorithm, originally proposed for unsupervised learning of allophonic variations within a given phoneme set, has been adapted to learn without any knowledge of the phonemes. An evaluation methodology is also proposed, whereby the state-sequence that aligns to a test utterance is transduced in an automatic manner to a phoneme-sequence and compared to its manual transcription. Over 85% phoneme recognition accuracy is demonstrated for speaker-dependent learning from fluent, large-vocabulary speech.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.