Interspeech 2019 2019
DOI: 10.21437/interspeech.2019-1292
|View full text |Cite
|
Sign up to set email alerts
|

End-to-End Monaural Speech Separation with Multi-Scale Dynamic Weighted Gated Dilated Convolutional Pyramid Network

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
14
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
6

Relationship

0
6

Authors

Journals

citations
Cited by 17 publications
(14 citation statements)
references
References 13 publications
0
14
0
Order By: Relevance
“…By default, we use the scale-invariant signal-to-noise ratio (SI-SNR) [26] with permutation-invariant training [1] as our objective function. SI-SNR is a widely used objective function for end-to-end speech source separation [8,10,11,15]. In certain experiments (see Section 4.2), in an effort to constrain the scale of the predicted sources from the deep encoder/decoder, we augment the objective function with a power-law term that encourages the model to predict spectra that are of similar magnitude to the ground truth.…”
Section: Objective Functionsmentioning
confidence: 99%
See 2 more Smart Citations
“…By default, we use the scale-invariant signal-to-noise ratio (SI-SNR) [26] with permutation-invariant training [1] as our objective function. SI-SNR is a widely used objective function for end-to-end speech source separation [8,10,11,15]. In certain experiments (see Section 4.2), in an effort to constrain the scale of the predicted sources from the deep encoder/decoder, we augment the objective function with a power-law term that encourages the model to predict spectra that are of similar magnitude to the ground truth.…”
Section: Objective Functionsmentioning
confidence: 99%
“…On the other hand, Yang et al [15] is another recent work that enhances the separator by utilizing an embedding network and clustering. Their separator also takes advantage of STFT features, whereas our model learns only from [15] 16.9 10M FurcaPy [11] 18.4 N/A waveforms. When comparing their results with ours, we note that by simply increasing the depth of the encoder/decoder one can achieve a similar improvement to Yang et al [15].…”
Section: Deep Encoder/decoder Vs Enhanced Separatorsmentioning
confidence: 99%
See 1 more Smart Citation
“…This strategy imposes an upper limit on the separation performance. To overcome this problem, time-domain approach is proposed in paper [14] , which directly model the mixture waveform using an encodedecoder framework and has made great progress in recent years [15,16,17,18,19,20,21].…”
Section: Introductionmentioning
confidence: 99%
“…These structural limitations are circumvented by end-to-end speech separation systems such as TasNet [6], Conv-TasNet [7] or FurcaNext [8,9]. These systems introduce several changes to the STFT magnitude based approaches: First, the training loss is defined in the time domain instead of the STFT domain.…”
Section: Introductionmentioning
confidence: 99%