Improving Universal Sound Separation Using Sound Classification

Tzinis, Efthymios; Wisdom, Scott; Hershey, John R.; Jansen, Aren; Ellis, Daniel P. W.

doi:10.1109/icassp40776.2020.9053921

Cited by 61 publications

(48 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Tzinis et al [4] performed separation experiments with a fixed number of sources on the 50-class ESC-50 dataset [5]. Other papers have leveraged information about sound class, either as conditioning information or as as a weak supervision signal [6,2,7].…”

Section: Relation To Prior Workmentioning

confidence: 99%

“…Finally, we have released a masking-based baseline separation model, based on an improved time-domain convolutional network (TDCN++), described in our recent publications [1,2]. On the FUSS test set, this model achieves 9.8 dB of scale-invariant signal-to-noise ratio improvement (SI-SNRi) on mixtures with two to four sources, while reconstructing single-source inputs with 35.8 dB SI-SNR.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

What’s all the Fuss about Free Universal Sound Separation Data?

Wisdom

Erdoğan

Ellis

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

We introduce the Free Universal Sound Separation (FUSS) dataset, a new corpus for experiments in separating mixtures of an unknown number of sounds from an open domain of sound types. The dataset consists of 23 hours of single-source audio data drawn from 357 classes, which are used to create mixtures of one to four sources. To simulate reverberation, an acoustic room simulator is used to generate impulse responses of box-shaped rooms with frequencydependent reflective walls. Additional open-source data augmentation tools are also provided to produce new mixtures with different combinations of sources and room simulations. Finally, we introduce an open-source baseline separation model, based on an improved time-domain convolutional network (TDCN++), that can separate a variable number of sources in a mixture. This model achieves 9.8 dB of scale-invariant signal-to-noise ratio improvement (SI-SNRi) on mixtures with two to four sources, while reconstructing single-source inputs with 35.8 dB absolute SI-SNR. We hope this dataset will lower the barrier to new research and allow for fast iteration and application of novel techniques from other machine learning domains to the sound separation challenge.

show abstract

Section: Relation To Prior Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

What’s all the Fuss about Free Universal Sound Separation Data?

Wisdom

Erdoğan

Ellis

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

show abstract

“…Kavalerov et al [1] have recently shown that by constructing a suitable dataset of mixtures of sounds, one can obtain SI-SDR (scale-invariant SDR) improvements of up to 10 dB. These results can be further improved by conditioning the sound separation model on the embeddings computed by a pre-trained audio classifier, which can also be fine-tuned [9]. More recently, a completely unsupervised approach to universal sound separation was developed [10], which does not require any single-source ground truth data, but instead can be trained using only real-world audio mixtures.…”

Section: Related Workmentioning

confidence: 99%

One-Shot Conditional Audio Filtering of Arbitrary Sounds

Gfeller

Roblek

Tagliasacchi

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

We consider the problem of separating a particular sound source from a single-channel mixture, based on only a short sample of the target source (from the same recording). Using SoundFilter, a waveto-wave neural network architecture, we can train a model without using any sound class labels. Using a conditioning encoder model which is learned jointly with the source separation network, the trained model can be "configured" to filter arbitrary sound sources, even ones that it has not seen during training. Evaluated on the FSD50k dataset, our model obtains an SI-SDR improvement of 9.6 dB for mixtures of two sounds. When trained on Librispeech, our model achieves an SI-SDR improvement of 14.0 dB when separating one voice from a mixture of two speakers. Moreover, we show that the representation learned by the conditioning encoder clusters acoustically similar sounds together in the embedding space, even though it is trained without using any labels.

show abstract

“…The PASE loss term enforces the frame permutations to align with the best utterance permutation π * u . Previous works also show that conditioning source separation models with additional features improves performance [25][26][27], but whether feature conditioning helps reducing permutation errors has yet to be confirmed. Towards this end, we extend the single-stage uPIT+PASE paradigm to a two-stage cascaded system.…”

Section: Upit + Pase For Conv-tasnetmentioning

confidence: 99%

On Permutation Invariant Training For Speech Source Separation

Liu

Pons

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

We study permutation invariant training (PIT), which targets at the permutation ambiguity problem for speaker independent source separation models. We extend two state-of-the-art PIT strategies. First, we look at the two-stage speaker separation and tracking algorithm based on frame level PIT (tPIT) and clustering, which was originally proposed for the STFT domain, and we adapt it to work with waveforms and over a learned latent space. Further, we propose an efficient clustering loss scalable to waveform models. Second, we extend a recently proposed auxiliary speaker-ID loss with a deep feature loss based on "problem agnostic speech features", to reduce the local permutation errors made by the utterance level PIT (uPIT). Our results show that the proposed extensions help reducing permutation ambiguity. However, we also note that the studied STFTbased models are more effective at reducing permutation errors than waveform-based models, a perspective overlooked in recent studies.

show abstract

Improving Universal Sound Separation Using Sound Classification

Cited by 61 publications

References 22 publications

What’s all the Fuss about Free Universal Sound Separation Data?

What’s all the Fuss about Free Universal Sound Separation Data?

One-Shot Conditional Audio Filtering of Arbitrary Sounds

On Permutation Invariant Training For Speech Source Separation

Contact Info

Product

Resources

About