Zhong-Qiu Wang scite author profile

This paper proposes an end-to-end approach for single-channel speaker-independent multi-speaker speech separation, where time-frequency (T-F) masking, the short-time Fourier transform (STFT), and its inverse are represented as layers within a deep network. Previous approaches, rather than computing a loss on the reconstructed signal, used a surrogate loss based on the target STFT magnitudes. This ignores reconstruction error introduced by phase inconsistency. In our approach, the loss function is directly defined on the reconstructed signals, which are optimized for best separation. In addition, we train through unfolded iterations of a phase reconstruction algorithm, represented as a series of STFT and inverse STFT layers. While mask values are typically limited to lie between zero and one for approaches using the mixture phase for reconstruction, this limitation is less relevant if the estimated magnitudes are to be used together with phase reconstruction. We thus propose several novel activation functions for the output layer of the T-F masking, to allow mask values beyond one. On the publiclyavailable wsj0-2mix dataset, our approach achieves state-ofthe-art 12.6 dB scale-invariant signal-to-distortion ratio (SI-SDR) and 13.1 dB SDR, revealing new possibilities for deep learning based phase reconstruction and representing a fundamental progress towards solving the notoriously-hard cocktail party problem.

show abstract

A Joint Training Framework for Robust Automatic Speech Recognition

Wang

2016

IEEE/ACM Trans. Audio Speech Lang. Process.

138

View full text Add to dashboard Cite

Complex Spectral Mapping for Single- and Multi-Channel Speech Enhancement and Robust ASR

Wang

2020

IEEE/ACM Trans. Audio Speech Lang. Process.

165

View full text Add to dashboard Cite

Robust Speaker Localization Guided by Deep Learning-Based Time-Frequency Masking

Wang

Zhang

Wang

2019

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Zhong-Qiu Wang

Multi-Channel Deep Clustering: Discriminative Spectral and Spatial Embeddings for Speaker-Independent Speech Separation

End-to-End Speech Separation with Unfolded Iterative Phase Reconstruction

A Joint Training Framework for Robust Automatic Speech Recognition

Complex Spectral Mapping for Single- and Multi-Channel Speech Enhancement and Robust ASR

Robust Speaker Localization Guided by Deep Learning-Based Time-Frequency Masking

Contact Info

Product

Resources

About