2020
DOI: 10.1016/j.dsp.2020.102731
|View full text |Cite
|
Sign up to set email alerts
|

A multi-objective learning speech enhancement algorithm based on IRM post-processing with joint estimation of SCNN and TCNN

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
13
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
9
1

Relationship

0
10

Authors

Journals

citations
Cited by 16 publications
(15 citation statements)
references
References 20 publications
0
13
0
Order By: Relevance
“…Originally, the DNN-predicted TF masks were directly applied to the STFT of the noisy signal in order to extract the target speech [29], [30]. This idea continues to be used with a good performance both in the single-channel [31] and the multichannel context [32], but it requires much better TF masks and complex DNN architectures. It also suffers from distortion that can be alleviated by using multichannel filters.…”
Section: Mask-based Multichannel Speech Enhancementmentioning
confidence: 99%
“…Originally, the DNN-predicted TF masks were directly applied to the STFT of the noisy signal in order to extract the target speech [29], [30]. This idea continues to be used with a good performance both in the single-channel [31] and the multichannel context [32], but it requires much better TF masks and complex DNN architectures. It also suffers from distortion that can be alleviated by using multichannel filters.…”
Section: Mask-based Multichannel Speech Enhancementmentioning
confidence: 99%
“…The TCNN, utilizing causal and dilated kernels, demonstrated substantial performance improvements compared to the above networks for temporal modeling tasks. Inspired by the success of the TCNN, in [20], Li et al proposed a stacked and temporal convolutional neural network (STCNN) to jointly implement the spectrum mapping and T-F masking tasks. The STCNN benefited immensely from the feature extraction ability of the stacking CNNs (SCNNs) and the temporal modeling ability of the TCNN, and, as a result, has a state-of-the-art performance in the MOL-based speech enhancement field.…”
Section: Introductionmentioning
confidence: 99%
“…The former method directly processes the speech waveform in the time domain [1][2][3], while the latter method employs short-time Fourier transform (STFT) to convert the time-domain speech into a frequency spectrum represented in the time-frequency domain and processes the spectrum as input features [4][5][6]. The latter approach can enhance the noisy speech by predicting masks (e.g., IBM [7], IRM [8], CRM [9], and PSM [10]) required for speech enhancement. These masks come in different types and serve various purposes.…”
Section: Introductionmentioning
confidence: 99%