Denoising Auto-Encoder with Recurrent Skip Connections and Residual Regression for Music Source Separation

Liu, Jen-Yu; Yang, Yi-Hsuan

doi:10.1109/icmla.2018.00123

Cited by 37 publications

(29 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Second, unlike prior arts (including [16]), we investigate one additional way to employ SS to improve SID. Given the separated vocal tracks and instrumental tracks of the audio recordings in the training set, we perform the so-called "data augmentation" [19][20][21][22] by randomly shuffling the separated tracks of different songs and then remixing them. For example, we remix the vocal part of a song from a singer with the instrumental part of another song from a different singer.…”

Section: Conv Blockmentioning

confidence: 99%

See 1 more Smart Citation

Addressing The Confounds Of Accompaniments In Singer Identification

Hsieh

Cheng

Fan

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

Identifying singers is an important task with many applications. However, the task remains challenging due to many issues. One major issue is related to the confounding factors from the background instrumental music that is mixed with the vocals in music production. A singer identification model may learn to extract non-vocal related features from the instrumental part of the songs, if a singer only sings in certain musical contexts (e.g., genres). The model cannot therefore generalize well when the singer sings in unseen contexts. In this paper, we attempt to address this issue. Specifically, we employ open-unmix, an open source tool with state-of-the-art performance in source separation, to separate the vocal and instrumental tracks of music. We then investigate two means to train a singer identification model: by learning from the separated vocal only, or from an augmented set of data where we "shuffle-and-remix" the separated vocal tracks and instrumental tracks of different songs to artificially make the singers sing in different contexts. We also incorporate melodic features learned from the vocal melody contour for better performance. Evaluation results on a benchmark dataset called the artist20 shows that this data augmentation method greatly improves the accuracy of singer identification.

show abstract

Section: Conv Blockmentioning

confidence: 99%

“…This technique has been popular for some time among the machine learning community. It has also been shown beneficial for MIR tasks such as singing voice detection and source separation [20][21][22] (but not yet for SID).…”

Section: Data Augmentation: Separate Shuffle and Remixmentioning

confidence: 99%

Addressing The Confounds Of Accompaniments In Singer Identification

Hsieh

Cheng

Fan

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

show abstract

“…Mean square error is used as the loss function for updating the network, and Adam [Kingma and Ba, 2015] is used to update the weights. As data augmentation has been found useful in the literature [Takahashi et al, 2018;Uhlich et al, 2017;Liu and Yang, 2018], we use data augmentation in the training process by randomly shuffling the audio clips in each source and then collecting the audio clips from the four sources in the shuffled orders. A mixture clip is formed by summing the collected source clips.…”

Section: Evaluation Setupmentioning

confidence: 99%

“…The phases of the mixture complex spectrograms are used with the predicted spectrogram magnitudes to construct the complex spectrogram. Before converting back to waveforms, multi-channel Wiener filter is applied to the complex spectrograms as widely done in recent source separation systems [Nugraha et al, 2016;Uhlich et al, 2017;Takahashi et al, 2018;Liu and Yang, 2018]. Table 2 shows the performance of the proposed models and the top-performing models of SiSEC2018 [Rafii et al, 2018].…”

Section: Evaluation Setupmentioning

confidence: 99%

Dilated Convolution with Dilated GRU for Music Source Separation

Liu

Yang

2019

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence

Self Cite

View full text Add to dashboard Cite

Stacked dilated convolutions used in Wavenet have been shown effective for generating high-quality audios. By replacing pooling/striding with dilation in convolution layers, they can preserve highresolution information and still reach distant locations. Producing high-resolution predictions is also crucial in music source separation, whose goal is to separate different sound sources while maintaining the quality of the separated sounds. Therefore, this paper investigates using stacked dilated convolutions as the backbone for music source separation. However, while stacked dilated convolutions can reach wider context than standard convolutions, their effective receptive fields are still fixed and may not be wide enough for complex music audio signals. To reach information at remote locations, we propose to combine dilated convolution with a modified version of gated recurrent units (GRU) called the 'Dilated GRU' to form a block. A Dilated GRU unit receives information from k steps before instead of the previous step for a fixed k. This modification allows a GRU unit to reach a location with fewer recurrent steps and run faster because it can execute partially in parallel. We show that the proposed model with a stack of such blocks performs equally well or better than the state-ofthe-art models for separating vocals and accompaniments.

show abstract

“…There are generally three basic structures to construct DNNs: Feed-Forward Network (FFN) [4], Recurrent Neural Network (RNN) [5], and Convolutional Neural Network (CNN) [6] [7]. Recently the RNN and CNN have been combined to improve the MSS [8], [9].…”

Section: Introductionmentioning

confidence: 99%

A Skip Attention Mechanism for Monaural Singing Voice Separation

Yuan

Wang

et al. 2019

IEEE Signal Process. Lett.

View full text Add to dashboard Cite

This work proposes a simple but effective attention mechanism, namely Skip Attention (SA), for monaural singing voice separation (MSVS). First, the SA, embedded in the convolutional encoder-decoder network (CEDN), realizes an attentiondriven and dependency modeling for the repetitive structures of the music source. Second, the SA, replacing the popular skip connection in the CEDN, effectively controls the flow of the lowlevel (vocal and musical) features to the output and improves the feature sensitivity and accuracy for MSVS. Finally, we implement the proposed SA on the Stacked Hourglass Network (SHN), namely Skip Attention SHN (SA-SHN). Quantitative and qualitative evaluation results have shown that the proposed SA-SHN achieves significant performance improvement on the MIR-1K dataset (compared to the state-of-the-art SHN) and competitive MSVS performance on the DSD100 dataset (compared to the state-of-the-art DenseNet), even without using any data augmentation methods.

show abstract

Denoising Auto-Encoder with Recurrent Skip Connections and Residual Regression for Music Source Separation

Cited by 37 publications

References 19 publications

Addressing The Confounds Of Accompaniments In Singer Identification

Addressing The Confounds Of Accompaniments In Singer Identification

Dilated Convolution with Dilated GRU for Music Source Separation

A Skip Attention Mechanism for Monaural Singing Voice Separation

Contact Info

Product

Resources

About