The problem of single-channel speech enhancement has been traditionally addressed by using statistical signal processing algorithms that are designed to suppress time-frequency regions affected by noise. We study an alternative data-driven approach which uses deep neural networks (DNNs) to learn the transformation from noisy and reverberant speech to clean speech, with a focus on real-time applications which require low-latency causal processing. We examine several structures in which deep learning can be used within an enhancement system. These include end-to-end DNN regression from noisy to clean spectra, as well as less intervening approaches which estimate a suppression gain for each time-frequency bin instead of directly recovering the clean spectral features. We also propose a novel architecture in which the general structure of a conventional noise suppressor is preserved, but the sub-tasks are independently learned and carried out by separate networks. It is shown that DNN-based suppression gain estimation outperforms the regression approach in the causal processing mode and for noise types that are not seen during DNN training.
Recognition of distant (far-field) speech is a challenge for ASR due to mismatch in recording conditions resulting from room reverberation and environment noise. Given the remarkable learning capacity of deep neural networks, there is increasing interest to address this problem by using a large corpus of reverberant far-field speech to train robust models. In this study, we explore how an end-to-end RNN acoustic model trained on speech from different rooms and acoustic conditions (different domains) achieves robustness to environmental variations. It is shown that the first hidden layer acts as a domain separator, projecting the data from different domains into different subspaces. The subsequent layers then use this encoded domain knowledge to map these features to final representations that are invariant to domain change. This mechanism is closely related to noise-aware or room-aware approaches which append manually-extracted domain signatures to the input features. Additionaly, we demonstrate how this understanding of the learning procedure provides useful guidance for model adaptation to new acoustic conditions. We present results based on AMI corpus to demonstrate the propagation of domain information in a deep RNN, and perform recognition experiments which indicate the role of encoded domain knowledge on training and adaptation of RNN acoustic models.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.