ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9053424
|View full text |Cite
|
Sign up to set email alerts
|

Improving Singing Voice Separation with the Wave-U-Net Using Minimum Hyperspherical Energy

Abstract: In recent years, deep learning has surpassed traditional approaches to the problem of singing voice separation. The Wave-U-Net is a recent deep network architecture that operates directly on the time domain. The standard Wave-U-Net is trained with data augmentation and early stopping to prevent overfitting. Minimum hyperspherical energy (MHE) regularization has recently proven to increase generalization in image classification problems by encouraging a diversified filter configuration. In this work, we apply M… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 11 publications
(4 citation statements)
references
References 17 publications
0
4
0
Order By: Relevance
“…Cohen-Hadria et al [169] investigated data augmentation techniques such as pitch-shifting and time-scaling on publicly available smaller datasets and compared the performance of U-Net and Wave-U-Net models. In order to avoid overfitting, besides using data augmentation, Perez-Lapillo et al [170] employed an additional regularization term in the loss function, called the minimum hyperspherical energy for the Wave-U-Net architecture, where the diversity of neurons is promoted by minimizing the hyperspherical energy of the neurons in each layer. Since convolutional encoder-decoder frameworks are sensitive to the sound level of the input, Lin et al [171] showed that a combination of data augmentation, frame normalization, and zero-mean convolution makes the network sound-level invariant.…”
Section: Singing Voice Separationmentioning
confidence: 99%
“…Cohen-Hadria et al [169] investigated data augmentation techniques such as pitch-shifting and time-scaling on publicly available smaller datasets and compared the performance of U-Net and Wave-U-Net models. In order to avoid overfitting, besides using data augmentation, Perez-Lapillo et al [170] employed an additional regularization term in the loss function, called the minimum hyperspherical energy for the Wave-U-Net architecture, where the diversity of neurons is promoted by minimizing the hyperspherical energy of the neurons in each layer. Since convolutional encoder-decoder frameworks are sensitive to the sound level of the input, Lin et al [171] showed that a combination of data augmentation, frame normalization, and zero-mean convolution makes the network sound-level invariant.…”
Section: Singing Voice Separationmentioning
confidence: 99%
“…In this study, the 1DCNN model was trained with a batch size of 128 and a maximum of 500 epochs using the Adam optimizer [34] to minimize the multitasking loss generated by calculating the sum of the product of each task-specific loss and the loss weighting. Besides, a dropout rate of 0.2 and early stopping patience of 15 (epochs without improvement of test loss) [35] were also implemented to prevent overfitting.…”
Section: One-dimension Convolutional Neural Network (1dcnn)mentioning
confidence: 99%
“…To alleviate these difficulties, Lin et al [41] propose the compressive MHE (CoMHE) as a more effective regularization to minimize hyperspherical energy for neural networks. Following [18], [41], Perez-Lapillo et al [42] and Shah et al [43] improve voice separation by applying MHE to Wave-U-Net and timefrequency domain networks, respectively. MHE has wide applications in image recognition [39], [44], [45], face recognition [36], [18], [46], speaker verification [47], adversarial robustness [48], few-shot learning [49], [50], etc.…”
Section: Related Workmentioning
confidence: 99%