A Style Transfer Approach to Source Separation

Venkataramani, Shrikant; Tzinis, Efthymios; Smaragdis, Paris

doi:10.1109/waspaa.2019.8937203

Cited by 3 publications

(2 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In music, the most useful application is that of separating the lead vocals from a musical mixture. This problem is well researched and numerous deep learning based models have recently been proposed to tackle it [4,5,6,7,8,9,10,11]. Most of these models use the neural network to predict soft time frequency masks, given an input magnitude spectrogram of the mixture signal.…”

Section: Introductionmentioning

confidence: 99%

Content Based Singing Voice Extraction from a Musical Mixture

Chandna

Blaauw

Bonada

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

We present a deep learning based methodology for extracting the singing voice signal from a musical mixture based on the underlying linguistic content. Our model follows an encoder-decoder architecture and takes as input the magnitude component of the spectrogram of a musical mixture with vocals. The encoder part of the model is trained via knowledge distillation using a teacher network to learn a content embedding, which is decoded to generate the corresponding vocoder features. Using this methodology, we are able to extract the unprocessed raw vocal signal from the mixture even for a processed mixture dataset with singers not seen during training. While the nature of our system makes it incongruous with traditional objective evaluation metrics, we use subjective evaluation via listening tests to compare the methodology to state-of-the-art deep learning based source separation algorithms. We also provide sound examples and source code for reproducibility.

show abstract

Section: Introductionmentioning

confidence: 99%

Content Based Singing Voice Extraction from a Musical Mixture

Chandna

Blaauw

Bonada

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…To relax the constraints of paired training data, a few recent approaches interpret the problem of denoising and source separation as a style-transfer problem wherein, the goal is to map from the domain of noisy mixtures to the domain of clean sounds (Stoller et al, 2018;Michelashvili et al, 2019;Venkataramani et al, 2019). These approaches only require a training set of mixtures and a training set of clean sounds, but the clean sounds can be unpaired and unrelated to the mixtures.…”

Section: Introductionmentioning

confidence: 99%

Self-supervised Learning for Speech Enhancement

Wang,

Venkataramani,

Smaragdis

2020

Preprint

Self Cite

View full text Add to dashboard Cite

Supervised learning for single-channel speech enhancement requires carefully labeled training examples where the noisy mixture is input into the network and the network is trained to produce an output close to the ideal target. To relax the conditions on the training data, we consider the task of training speech enhancement networks in a selfsupervised manner. We first use a limited training set of clean speech sounds and learn a latent representation by autoencoding on their magnitude spectrograms. We then autoencode on speech mixtures recorded in noisy environments and train the resulting autoencoder to share a latent representation with the clean examples. We show that using this training schema, we can now map noisy speech to its clean version using a network that is autonomously trainable without requiring labeled training examples or human intervention.

show abstract