This paper presents an audio-visual beat-tracking method for an entertainment robot that can dance in synchronization with music and human dancers. Conventional music robots have focused on either music audio signals or dancing movements of humans for detecting and predicting beat times in real time. Since a robot needs to record music audio signals by using its own microphones, however, the signals are severely contaminated with loud environmental noise and reverberant sounds. Moreover, it is difficult to visually detect beat times from real complicated dancing movements that exhibit weaker repetitive characteristics than music audio signals do. To solve these problems, we propose a state-space model that integrates both audio and visual information in a probabilistic manner. At each frame, the method extracts acoustic features (audio tempos and onset likelihoods) from music audio signals and extracts skeleton features from movements of a human dancer. The current tempo and the next beat time are then estimated from those observed features by using a particle filter. Experimental results showed that the proposed multi-modal method using a depth sensor (Kinect) for extracting skeleton features outperformed conventional mono-modal methods by 0.20 (F measure) in terms of beat-tracking accuracy in a noisy and reverberant environment.
Abstract.A system for transferring vocal expressions separately from singing voices with accompaniment to singing voice synthesizers is described. The expressions appear as fluctuations in the fundamental frequency contour of the singing voice, such as vibrato, glissando, and kobushi. The fundamental frequency contour of the singing voice is estimated using the subharmonic summation in a limited frequency range and aligned temporally to chromatic pitch sequence. Each expression is transcribed and parameterized in accordance with designed rules. Finally, the expressions are transferred to given scores on the singing voice synthesizer. Experiments demonstrated that the proposed system can transfer the vocal expressions while retaining singer's individuality on two singing voice synthesizers: the Vocaloid and the CeVIO.
A method for transcribing vocal expressions such as vibrato, glissando, and kobushi separately from polyphonic music is described. The expressions appear as fluctuation in the fundamental frequency contour of the singing voice. They can be used for search and retrieval of music and for expressive singing voice synthesis based on singing style since they strongly reflect the individuality of the singer. The fundamental frequency contour of the singing voice is estimated using the Viterbi algorithm with limitation from a corresponding note sequence. Next, the notes are aligned with the fundamental frequency sequence temporally. Finally, each expression is identified and parameterized in accordance with designed rules. Experiments demonstrated that this method can transcribe expressions in the singing voice from commercial recordings.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.