A Simultaneous Denoising and Dereverberation Framework with Target Decoupling

Li, Andong; Liu, Wenzhe; Luo, Xiaoxue; Yu, Guochen; Zheng, Chengshi; Li, Xiaodong

doi:10.21437/interspeech.2021-1137

Cited by 55 publications

(21 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The 320-point STFT is utilized and 161dimension spectral features can be obtained. Due to the efficacy of the compressed spectrum in dereverberation and denoising task [15], [36], we conduct the power compression toward the magnitude while remaining the phase unaltered, and the optimal compression coefficient is set to 0.5, i.e., Cat |X| 0.5 cos (θ X ) , |X| 0.5 sin (θ X ) as input, Cat |S| 0.5 cos (θ S ) , |S| 0.5 sin (θ S ) as target. All the models are optimized using Adam [37] with the learning rate of 8e-4.…”

Section: B Implementation Setupmentioning

confidence: 99%

“…For example, in [14], real-valued convolutional recurrent networks (CRN) were leveraged to directly map the RI components of target speech, where the enhanced RI components were decoded by two decoders respectively. More recently, a handful of multi-stage decoupling-style methods have thrived in the SE area and were demonstrated to achieve a remarkable performance [10], [15], [16]. Instead of packing the mapping process into only one black box in the previous single-stage paradigm, these multi-stage methods decoupled the original complex spectrum estimation into optimizing magnitude and phase stage by stage, and alleviated the implicit compensation effect between two targets [17].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

DBT-Net: Dual-branch federative magnitude and phase estimation with attention-in-attention transformer for monaural speech enhancement

Yu¹,

Li²,

Wang³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

The decoupling-style concept begins to ignite in the speech enhancement area, which decouples the original complex spectrum estimation task into multiple easier sub-tasks (i.e., magnitude and phase), resulting in better performance and easier interpretability. In this paper, we propose a dual-branch federative magnitude and phase estimation framework, dubbed DBT-Net, for monaural speech enhancement, which aims at recovering the coarse-and fine-grained regions of the overall spectrum in parallel. From the complementary perspective, the magnitude estimation branch is designed to filter out dominant noise components in the magnitude domain, while the complex spectrum purification branch is elaborately designed to inpaint the missing spectral details and implicitly estimate the phase information in the complex domain. To facilitate the information flow between each branch, interaction modules are introduced to leverage features learned from one branch, so as to suppress the undesired parts and recover the missing components of the other branch. Instead of adopting the conventional RNNs and temporal convolutional networks for sequence modeling, we propose a novel attention-in-attention transformer-based network within each branch for better feature learning. More specially, it is composed of several adaptive spectro-temporal attention transformerbased modules and an adaptive hierarchical attention module, aiming to capture long-term time-frequency dependencies and further aggregate intermediate hierarchical contextual information. Comprehensive evaluations on the WSJ0-SI84 + DNS-Challenge and VoiceBank + DEMAND dataset demonstrate that the proposed approach consistently outperforms previous advanced systems and yields state-of-the-art performance in terms of speech quality and intelligibility.

show abstract

Section: B Implementation Setupmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

DBT-Net: Dual-branch federative magnitude and phase estimation with attention-in-attention transformer for monaural speech enhancement

Yu¹,

Li²,

Wang³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…DNS-MOS is used to do model training and model selection during noise suppression development. DNSMOS is also used for doing ablation studies for noise suppressors [22,23]. DNS-MOS has been quite popular, with over a hundred researchers using it after several months of releasing it.…”

Section: Related Workmentioning

confidence: 99%

DNSMOS P.835: A Non-Intrusive Perceptual Objective Speech Quality Metric to Evaluate Noise Suppressors

Reddy¹,

Gopal²,

Cutler³

2021

Preprint

View full text Add to dashboard Cite

Human subjective evaluation is the "gold standard" to evaluate speech quality optimized for human perception. Perceptual objective metrics serve as a proxy for subjective scores. We have recently developed a non-intrusive speech quality metric called Deep Noise Suppression Mean Opinion Score (DNSMOS) using the scores from ITU-T Rec. P.808 [1] subjective evaluation. The P.808 scores reflect the overall quality of the audio clip. ITU-T Rec. P.835 [2] subjective evaluation framework gives the standalone quality scores of speech and background noise in addition to the overall quality. In this work, we train an objective metric based on P.835 human ratings that outputs 3 scores: i) speech quality (SIG), ii) background noise quality (BAK), and iii) the overall quality (OVRL) of the audio. The developed metric is highly correlated with human ratings, with a Pearson's Correlation Coefficient (PCC)=0.94 for SIG and PCC=0.98 for BAK and OVRL. This is the first non-intrusive P.835 predictor we are aware of. DNSMOS P.835 is made publicly available as an Azure service.

show abstract

“…However, most of the previous studies on speech enhancement are for narrow-band (8 kHz) or wide-band (16 kHz) audio, and there are few methods for 48 kHz full-band audio. Deep learning-based speech enhancement methods [1,2,3] have achieved impressive performance on wide-band audio, but the lack of sufficient training data has become a major limitation for full-band deep learning speech enhancement methods. The recent 4th Microsoft * Equal contribution Deep Noise Suppression (DNS-4) Challenge 1 extends efforts to full-band single-channel speech enhancement tasks with a massive training dataset and real-scenario test set.…”

Section: Introductionmentioning

confidence: 99%

FB-MSTCN: A Full-Band Single-Channel Speech Enhancement Method Based on Multi-Scale Temporal Convolutional Network

Zhang¹,

Zhang²,

Zhuang³

et al. 2022

Preprint

View full text Add to dashboard Cite

In recent years, deep learning-based approaches have significantly improved the performance of single-channel speech enhancement. However, due to the limitation of training data and computational complexity, real-time enhancement of fullband (48 kHz) speech signals is still very challenging. Because of the low energy of spectral information in the highfrequency part, it is more difficult to directly model and enhance the full-band spectrum using neural networks. To solve this problem, this paper proposes a two-stage real-time speech enhancement model with extraction-interpolation mechanism for a full-band signal. The 48 kHz full-band time-domain signal is divided into three sub-channels by extracting, and a two-stage processing scheme of 'masking + compensation' is proposed to enhance the signal in the complex domain. After the two-stage enhancement, the enhanced full-band speech signal is restored by interval interpolation. In the subjective listening and word accuracy test, our proposed model achieves superior performance and outperforms the baseline model overall by 0.59 MOS and 4.0% WAcc for the nonpersonalized speech denoising task.

show abstract

A Simultaneous Denoising and Dereverberation Framework with Target Decoupling

Cited by 55 publications

References 0 publications

DBT-Net: Dual-branch federative magnitude and phase estimation with attention-in-attention transformer for monaural speech enhancement

DBT-Net: Dual-branch federative magnitude and phase estimation with attention-in-attention transformer for monaural speech enhancement

DNSMOS P.835: A Non-Intrusive Perceptual Objective Speech Quality Metric to Evaluate Noise Suppressors

FB-MSTCN: A Full-Band Single-Channel Speech Enhancement Method Based on Multi-Scale Temporal Convolutional Network

Contact Info

Product

Resources

About