Singing Voice Separation: A Study on Training Data

Pretet, Laure; Hennequin, Romain; Royo-Letelier, Jimena; Vaglio, Andrea

doi:10.1109/icassp.2019.8683555

Cited by 32 publications

(26 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To compare the separation of singing voice with state-of-the-art, we also include models that separate the mixture into four sources. It has been shown in [6] that, these four-source models have similar vocal separation performance compared to two-source models, even though the four-source separation task is more challenging than the two-source counterpart; possibly because of the additional supervision provided by different instrumental sources in the multi-task learning setup. Hence, we include the vocal SDR values of state-ofthe-arts for four-source models [10,11] in our comparison.…”

Section: Comparison With Other Methodsmentioning

confidence: 99%

“…Using the best combination of input length (10 seconds) and model size (8.3M), we experiment with different probability of applying random mixing. [6] shows that random mixing does not have a positive effect on test SDR, and one possible explanation is that it creates mixtures with somewhat independent sources. Our experiments, however, indicate that random mixing alone significantly improves the results.…”

Section: Teacher Trainingmentioning

confidence: 96%

“…However, these datasets are relatively small (all these combined are around 15 hours) and not diverse. To artificially increase the size of the dataset, [6,16,17] apply data augmentation to signal including random channel swapping, amplitude scaling, remixing sources from different songs, time-stretching, pitch shifting, and filtering. These methods, individually or combined, are empirically shown to enhance separation performance only by a limited margin [6].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Semi-Supervised Singing Voice Separation With Noisy Self-Training

Wang

Giri

Isik

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Recent progress in singing voice separation has primarily focused on supervised deep learning methods. However, the scarcity of groundtruth data with clean musical sources has been a problem for long. Given a limited set of labeled data, we present a method to leverage a large volume of unlabeled data to improve the model's performance. Following the noisy self-training framework, we first train a teacher network on the small labeled dataset and infer pseudo-labels from the large corpus of unlabeled mixtures. Then, a larger student network is trained on combined ground-truth and self-labeled datasets. Empirical results show that the proposed self-training scheme, along with data augmentation methods, effectively leverage the large unlabeled corpus and obtain superior performance compared to supervised methods.

show abstract

Section: Comparison With Other Methodsmentioning

confidence: 99%

Section: Teacher Trainingmentioning

confidence: 96%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Semi-Supervised Singing Voice Separation With Noisy Self-Training

Wang

Giri

Isik

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…The pre-trained models are U-nets (Jansson et al, 2017) and follow similar specifications as in (Prétet, Hennequin, Royo-Letelier, & Vaglio, 2019). The U-net is an encoder/decoder Convolutional Neural Network (CNN) architecture with skip connections.…”

Section: Implementation Detailsmentioning

confidence: 99%

“…Training loss is a L 1 -norm between masked input mix spectrograms and source-target spectrograms. The models were trained on Deezer's internal datasets (noteworthily the Bean dataset that was used in (Prétet et al, 2019)) using Adam (Kingma & Ba, 2014). Training time took approximately a full week on a single GPU.…”

Section: Implementation Detailsmentioning

confidence: 99%