2018
DOI: 10.48550/arxiv.1806.03185
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation

Abstract: Models for audio source separation usually operate on the magnitude spectrum, which ignores phase information and makes separation performance dependant on hyperparameters for the spectral front-end. Therefore, we investigate end-to-end source separation in the time-domain, which allows modelling phase information and avoids fixed spectral transformations. Due to high sampling rates for audio, employing a long temporal input context on the sample level is difficult, but required for high quality separation res… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

0
135
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
4
4
1

Relationship

0
9

Authors

Journals

citations
Cited by 87 publications
(135 citation statements)
references
References 17 publications
0
135
0
Order By: Relevance
“…As presented in Table 2, our proposed methods are compared with other state-of-the-art (SOTA) SE approaches under standard paired data, including GAN-based methods (e.g., SEGAN [6], MM-SEGAN [9], RSGAN [7], MetricGAN [8], SASEGAN [12] and CRGAN [11]) and Non-GAN based methods(i.e., Wave-U-net [28], DFL-SE [29], CRN-MSE [30], GCRN [31], DCCRN [32] and TF-SNN [33]). Note that we reimplement GCRN and DCCRN on VoiceBank + Demand dataset, and directly use the reported scores of other methods in their original papers.…”
Section: Comparison With Other Competitive Methods Under Standard Par...mentioning
confidence: 99%
“…As presented in Table 2, our proposed methods are compared with other state-of-the-art (SOTA) SE approaches under standard paired data, including GAN-based methods (e.g., SEGAN [6], MM-SEGAN [9], RSGAN [7], MetricGAN [8], SASEGAN [12] and CRGAN [11]) and Non-GAN based methods(i.e., Wave-U-net [28], DFL-SE [29], CRN-MSE [30], GCRN [31], DCCRN [32] and TF-SNN [33]). Note that we reimplement GCRN and DCCRN on VoiceBank + Demand dataset, and directly use the reported scores of other methods in their original papers.…”
Section: Comparison With Other Competitive Methods Under Standard Par...mentioning
confidence: 99%
“…As a data-driven supervised learning approach, DNN-based speech enhancement can be mainly categorized into timefrequency domain [2][3][4] and time domain [5][6][7] methods. The time-frequency (T-F) domain methods aim to extract the acoustic features (e.g., complex spectrum or logarithmic power spectrum) of clean speech from the features of noisy speech.…”
Section: Introductionmentioning
confidence: 99%
“…As a typical method in the time domain, Conv-Tasnet [5] utilizes a 1-D convolution neural network (Conv-1D) [14] as an encoder to convert time-domain waveform into effective representations for effective clean speech estimation, and then converts the representations back to waveform by a transposed convolutional layer called decoder. Time domain methods suffer from the difficulty of modeling extremely long sequences so that very deep convolutional layers like wave-u-net [7] have to be utilized for feature compression. Conventional recurrent neural networks (RNNs) are also not effective for modeling such long sequences.…”
Section: Introductionmentioning
confidence: 99%
“…Compared with a single-channel audio signal, a multi-channel audio signal obtains more spatial information, thereby further assisting speech separation. Wave-U-Net [2] splices multi-channel signals are input into U-Net and need to change the number of input channels, but the input length of the time domain is usually not fixed if the series is very long, and its resistance to optimization means the traditional RNN model cannot be effectively used. Dual-path recurrent neural networks (DPRNN) optimize RNN in the deep model to process extremely long speech sequences [3].…”
Section: Introductionmentioning
confidence: 99%