-We demonstrate the utility of speech prosody as a feedback mechanism in a machine learning system. We have constructed a reinforcement learning system for our humanoid robot Nico, which uses prosodic feedback to refine the parameters of a social waving behavior. We define a waving behavior to be an oscillation of Nico's elbow joint, parameterized by amplitude and frequency. Our system explores a space of amplitude and frequency values, using q-learning to learn the wave which optimally satisfies a human tutor. To estimate tutor feedback in real-time, we first segment speech from ambient noise using a maximum-likelihood voice-activation detector. We then use a k-Nearest Neighbors classifier, with k=3, over 15 prosodic features, to estimate a binary approval/disapproval feedback signal from segmented utterances. Both our voiceactivation detector and prosody classifier are trained on the speech of the individual tutor. We show that our system learns the tutor's desired wave, over the course of a sequence of trialfeedback cycles. We demonstrate our learning results for a single speaker on a space of nine distinct waving behaviors.Index Terms -speech prosody, human-robot interaction, reinforcement learning, socially-guided machine learning.
Abstract-We examined whether evidence for prosodic signals about shared belief can be quantitatively found within the acoustic signal of infant-directed speech. Two transcripts of infant-directed speech for infants aged 1;4 and 1;6 were labeled with distinct speaker intents to modify shared beliefs, based on Pierrehumbert and Hirschberg's theory of the meaning of prosody [1]. Acoustic predictions were made from intent labels first within a simple single-tone model that reflected only whether the speaker intended to add a word's information to the discourse (high tone, H*) or not (low tone, L*). We also predicted pitch within a more complicated five-category model that added intents to suggest a word as one of several possible alternatives (L*+H), a contrasting alternative (L+H*), or something about which the listener should make an inference (H*+L). The acoustic signal was then manually segmented and automatically classified based solely on whether the pitches at the beginning, end, and peak intensity points of stressed syllables in salient words, were closer to the utterance's pitch minimum or maximum on a log scale.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.