We present a novel generative model for audio event transcription that recognizes "events" on audio signals including multiple kinds of overlapping sounds. In the proposed model, firstly, the overlapping audio events are modeled based on nonnegative matrix factorization into which Bayesian nonparametric approaches: the Markov Indian buffet process and the Chinese restaurant process, are incorporated. This approach allows us to automatically transcribe the events while avoiding the model selection problem by assuming a countably infinite number of possible audio events in the input signal. Then, Bayesian logistic regression annotates the audio frames with the multiple event labels in a semi-supervised learning setup. Experimental results show that our model can better annotate an audio signal in comparison with a baseline method. Additionally, we verify that our infinite generative model is also able to detect unknown audio events that are not included in the training data.
We propose a music segment detection method for audio signals. Unlike many existing methods, ours specifically focuses on a background-music detection task, that is, detecting music used in background of main sounds. This task is important because music is almost always overlapped by speech or other environmental sounds in visual materials such as TV programs. Our method consists of feature extraction, dimension reduction, and statistical discrimination steps. For each step, we analyzed a set of methods to maximize the detection accuracy. With a simple post processing step, we achieved a framewise error rate as low as 8 % even when the mixed speech was louder than the target music by 10dB.Index Terms-Background music detection, Gaussian mixture model, k-nearest neighbor method, feature selection
We propose a method for musical audio search based on signal matching. A major problem in the signal matching approach to musical audio search has been key variation; if the key of a query signal is significantly different from the one in the stored database, the search will fail. To cope with this problem, our method newly employs self-similarity as the feature for signal matching. The self-similarity proposed here is similarity of the power spectrum defined between two time points within an audio signal. We show that the method increases the robustness of musical audio search with respect to key variation. In our experimentation, for example, the proposed method yields precision and recall rates of around 0.75 even when the pitches in queries and stored signals differ from each other by seven semitones, whereas a conventional signal matching method does not produce meaningful results in such a case.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.