We introduce the Free Universal Sound Separation (FUSS) dataset, a new corpus for experiments in separating mixtures of an unknown number of sounds from an open domain of sound types. The dataset consists of 23 hours of single-source audio data drawn from 357 classes, which are used to create mixtures of one to four sources. To simulate reverberation, an acoustic room simulator is used to generate impulse responses of box-shaped rooms with frequencydependent reflective walls. Additional open-source data augmentation tools are also provided to produce new mixtures with different combinations of sources and room simulations. Finally, we introduce an open-source baseline separation model, based on an improved time-domain convolutional network (TDCN++), that can separate a variable number of sources in a mixture. This model achieves 9.8 dB of scale-invariant signal-to-noise ratio improvement (SI-SNRi) on mixtures with two to four sources, while reconstructing single-source inputs with 35.8 dB absolute SI-SNR. We hope this dataset will lower the barrier to new research and allow for fast iteration and application of novel techniques from other machine learning domains to the sound separation challenge.
Music source separation performance has greatly improved in recent years with the advent of approaches based on deep learning. Such methods typically require large amounts of labelled training data, which in the case of music consist of mixtures and corresponding instrument stems. However, stems are unavailable for most commercial music, and only limited datasets have so far been released to the public. It can thus be difficult to draw conclusions when comparing various source separation methods, as the difference in performance may stem as much from better data augmentation techniques or training tricks to alleviate the limited availability of training data, as from intrinsically better model architectures and objective functions. In this paper, we present the synthesized Lakh dataset (Slakh) as a new tool for music source separation research. Slakh consists of high-quality renderings of instrumental mixtures and corresponding stems generated from the Lakh MIDI dataset (LMD) using professional-grade sample-based virtual instruments. A first version, Slakh2100, focuses on 2100 songs, resulting in 145 hours of mixtures. While not fully comparable because it is purely instrumental, this dataset contains an order of magnitude more data than MUSDB18, the de facto standard dataset in the field. We show that Slakh can be used to effectively augment existing datasets for musical instrument separation, while opening the door to a wide array of data-intensive music signal analysis tasks.Index Termsmusic source separation, sample-based virtual instruments, synthesis, MIDI
Audio production is central to every kind of media that involves sound, such as film, television, and music and involves transforming audio into a state ready for consumption by the public. One of the most commonly-used audio production tools is the reverberator. Current interfaces are often complex and hard-to-understand. We seek to simplify these interfaces by letting users communicate their audio production objective with descriptive language (e.g. "Make the drums sound bigger."). To achieve this goal, a system must be able to tell whether the stated goal is appropriate for the selected tool (e.g. making the violin warmer using a panning tool does not make sense). If the goal is appropriate for the tool, it must know what actions lead to the goal. Further, the tool should not impose a vocabulary on users, but rather understand the vocabulary users prefer. In this work, we describe SocialReverb, a project to crowdsource a vocabulary of audio descriptors that can be mapped onto concrete actions using a parametric reverberator. We deployed SocialReverb, on Mechanical Turk, where 513 unique users described 256 instances of reverberation using 2861 unique words. We used this data to build a concept map showing which words are popular descriptors, which ones map consistently to specific reverberation types, and which ones are synonyms. This promises to enable future interfaces that let the user communicate their production needs using natural language.
Separating an audio scene into isolated sources is a fundamental problem in computer audition, analogous to image segmentation in visual scene analysis. Source separation systems based on deep learning are currently the most successful approaches for solving the underdetermined separation problem, where there are more sources than channels. Traditionally, such systems are trained on sound mixtures where the ground truth decomposition is already known. Since most real-world recordings do not have such a decomposition available, this limits the range of mixtures one can train on, and the range of mixtures the learned models may successfully separate. In this work, we use a simple blind spatial source separation algorithm to generate estimated decompositions of stereo mixtures. These estimates, together with a weighting scheme in the time-frequency domain, based on confidence in the separation quality, are used to train a deep learning model that can be used for single-channel separation, where no source direction information is available. This demonstrates how a simple cue such as the direction of origin of source can be used to bootstrap a model for source separation that can be used in situations where that cue is not available.
We present a single deep learning architecture that can both separate an audio recording of a musical mixture into constituent single-instrument recordings and transcribe these instruments into a human-readable format at the same time, learning a shared musical representation for both tasks. This novel architecture, which we call Cerberus, builds on the Chimera network for source separation by adding a third "head" for transcription. By training each head with different losses, we are able to jointly learn how to separate and transcribe up to five instruments with a single network. We show that separation and transcription are highly complementary with one another and when learned jointly, lead to Cerberus networks that are better at both separation and transcription and generalize better to unseen mixtures.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.