This paper presents an audio-visual beat-tracking method for an entertainment robot that can dance in synchronization with music and human dancers. Conventional music robots have focused on either music audio signals or dancing movements of humans for detecting and predicting beat times in real time. Since a robot needs to record music audio signals by using its own microphones, however, the signals are severely contaminated with loud environmental noise and reverberant sounds. Moreover, it is difficult to visually detect beat times from real complicated dancing movements that exhibit weaker repetitive characteristics than music audio signals do. To solve these problems, we propose a state-space model that integrates both audio and visual information in a probabilistic manner. At each frame, the method extracts acoustic features (audio tempos and onset likelihoods) from music audio signals and extracts skeleton features from movements of a human dancer. The current tempo and the next beat time are then estimated from those observed features by using a particle filter. Experimental results showed that the proposed multi-modal method using a depth sensor (Kinect) for extracting skeleton features outperformed conventional mono-modal methods by 0.20 (F measure) in terms of beat-tracking accuracy in a noisy and reverberant environment.
[abstFig src='/00290001/12.jpg' width='300' text='An overview of real-time audio-visual beat-tracking for music audio signals and human dance moves' ] This paper presents a real-time beat-tracking method that integrates audio and visual information in a probabilistic manner to enable a humanoid robot to dance in synchronization with music and human dancers. Most conventional music robots have focused on either music audio signals or movements of human dancers to detect and predict beat times in real time. Since a robot needs to record music audio signals with its own microphones, however, the signals are severely contaminated with loud environmental noise. To solve this problem, we propose a state-space model that encodes a pair of a tempo and a beat time in a state-space and represents how acoustic and visual features are generated from a given state. The acoustic features consist of tempo likelihoods and onset likelihoods obtained from music audio signals and the visual features are tempo likelihoods obtained from dance movements. The current tempo and the next beat time are estimated in an online manner from a history of observed features by using a particle filter. Experimental results show that the proposed multi-modal method using a depth sensor (Kinect) to extract skeleton features outperformed conventional mono-modal methods in terms of beat-tracking accuracy in a noisy and reverberant environment.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.