Multi-Microphone Complex Spectral Mapping for Speech Dereverberation

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

2021

Self Cite

Speech quality and intelligibility can be severely degraded by background noise in mobile communication. In order to attenuate background noise, speech enhancement systems have been integrated into mobile phones, and a microphone array is typically deployed to improve the enhancement performance. This paper proposes a novel approach to real-time speech enhancement for dual-microphone mobile phones. Our approach employs a causal densely-connected convolutional recurrent network to perform dual-channel complex spectral mapping. We apply a structured pruning technique for compressing the model without significantly affecting the enhancement performance. This leads to a real-time enhancement system for on-device processing. Evaluation results show that the proposed approach substantially advances the performance of an earlier approach to dual-channel speech enhancement for mobile communication.

Section: Resultssupporting

confidence: 65%

Section: Dual-channel Complex Spectral Mappingmentioning

confidence: 99%

See 1 more Smart Citation

Real-Time Speech Enhancement for Mobile Communication Based on Dual-Channel Complex Spectral Mapping

Tan

Zhang

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

2021

Self Cite

“…In this context, a block-online approach is proposed in [13]- [15] to address continuous speaker separation, where speech signals from an unknown number of speakers, degraded by environmental noise, room reverberation and a wide range of speaker overlap, arrive as a continuous stream. These studies assume that in each fixed-length short processing block, typically 2.4-second long, there are at most two speakers talking, so that a two-speaker separation model based on for example utterance-wise PIT (uPIT) can be applied in each block for separation.…”

Section: Introductionmentioning

confidence: 99%

Count And Separate: Incorporating Speaker Counting For Continuous Speaker Separation

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

2021

Self Cite

This study leverages frame-wise speaker counting to switch between speech enhancement and speaker separation for continuous speaker separation. The proposed approach counts the number of speakers at each frame. If there is no speaker overlap, a speech enhancement model is used to suppress noise and reverberation. Otherwise, a speaker separation model based on permutation invariant training is utilized to separate multiple speakers in noisy-reverberant conditions. We stitch the results from the enhancement and separation models based on their predictions in a small augmented window of frames surrounding an overlapped segment. Assuming a fixed array geometry between training and testing, we use multi-microphone complex spectral mapping for enhancement and separation, where deep neural networks are trained to predict the real and imaginary (RI) components of direct sound from stacked reverberant-noisy RI components of multiple microphones. Experimental results on the LibriCSS dataset demonstrate the effectiveness of our approach.

“…With fewer parameters, our best model outperforms the conformer in all scenarios for both utterance-wise and continuous evaluation. We should mention that a very recently posted paper [24] reports state-of-the-art results for the LibriCSS evaluation. This study uses complex spectral mapping to train the separation model.…”

Section: Evaluation Resultsmentioning

confidence: 99%

Time-Domain Loss Modulation Based on Overlap Ratio for Monaural Conversational Speaker Separation

Taherian

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

2021

Self Cite

Existing speaker separation methods deliver excellent performance on fully overlapped signal mixtures. To apply these methods in daily conversations that include occasional concurrent speakers, recent studies incorporate both overlapped and non-overlapped segments in the training data. However, such training data can degrade the separation performance due to triviality of non-overlapped segments where the model reflects the input to the output. We propose a new loss function for speaker separation based on permutation invariant training that dynamically reweighs losses using the segment overlap ratio. The new loss function emphasizes overlapped regions while deemphasizing the segments with single speakers. We demonstrate the effectiveness of the proposed loss function on an automatic speech recognition (ASR) task. Experiments on the recently introduced LibriCSS corpus show that our proposed single-channel method produces consistent improvements compared to baseline methods.