This paper presents a system aiming at joint dereverberation and noise reduction by applying a combination of a beamformer with a single-channel spectral enhancement scheme. First, a minimum variance distortionless response beamformer with an online estimated noise coherence matrix is used to suppress noise and reverberation. The output of this beamformer is then processed by a single-channel spectral enhancement scheme, based on statistical room acoustics, minimum statistics, and temporal cepstrum smoothing, to suppress residual noise and reverberation. The evaluation is conducted using the REVERB challenge corpus, designed to evaluate speech enhancement algorithms in the presence of both reverberation and noise. The proposed system is evaluated using instrumental speech quality measures, the performance of an automatic speech recognition system, and a subjective evaluation of the speech quality based on a MUSHRA test. The performance achieved by beamforming, single-channel spectral enhancement, and their combination are compared, and experimental results show that the proposed system is effective in suppressing both reverberation and noise while improving the speech quality. The achieved improvements are particularly significant in conditions with high reverberation times
Many speech dereverberation techniques require an estimate of the late reverberation power spectral density (PSD). State-of-the-art multi-channel methods for estimating the late reverberation PSD typically rely on 1) an estimate of the relative transfer functions (RTFs) of the target signal, 2) a model for the spatial coherence matrix of the late reverberation, and 3) an estimate of the reverberant speech or reverberant and noisy speech PSD matrix. The RTFs, the spatial coherence matrix, and the speech PSD matrix are all prone to modeling and estimation errors in practice, with the RTFs being particularly difficult to estimate accurately, especially in highly reverberant and noisy scenarios. Recently, we proposed an eigenvalue decomposition (EVD)-based late reverberation PSD estimator which does not require an estimate of the RTFs. In this paper, this EVD-based PSD estimator is further analyzed and its estimation accuracy and computational complexity is analytically compared to a state-of-the-art maximum likelihood (ML)-based PSD estimator. It is shown that for perfect knowledge of the RTFs, spatial coherence matrix, and reverberant speech PSD matrix, the ML-based and EVD-based PSD estimates are both equal to the true late reverberation PSD. In addition, it is shown that for erroneous RTFs but perfect knowledge of the spatial coherence matrix and reverberant speech PSD matrix, the MLbased PSD estimate is larger than or equal to the true late reverberation PSD, whereas the EVD-based PSD estimate is obviously still equal to the true late reverberation PSD. Finally, it is shown that when modeling and estimation errors occur in all quantities, the ML-based PSD estimate is larger than or equal to the EVD-based PSD estimate. Simulation results for several realistic acoustic scenarios demonstrate the advantages of using the EVD-based PSD estimator in a multi-channel Wiener filter, yielding a significantly better performance than the ML-based PSD estimator.
To assist the clinical diagnosis and treatment of neurological diseases that cause speech dysarthria such as Parkinson's disease (PD), it is of paramount importance to craft robust features which can be used to automatically discriminate between healthy and dysarthric speech. Since dysarthric speech of patients suffering from PD is breathy, semi-whispery, and is characterized by abnormal pauses and imprecise articulation, it can be expected that its spectro-temporal sparsity differs from the spectro-temporal sparsity of healthy speech. While we have recently successfully used temporal sparsity characterization for dysarthric speech detection, characterizing spectral sparsity poses the challenge of constructing a valid feature vector from signals with a different number of unaligned time frames. Further, although several non-parametric and parametric measures of sparsity exist, it is unknown which sparsity measure yields the best performance in the context of dysarthric speech detection. The objective of this paper is to demonstrate the advantages of spectro-temporal sparsity characterization for automatic dysarthric speech detection. To this end, we first provide a numerical analysis of the suitability of different non-parametric and parametric measures (i.e., l 1 -norm, kurtosis, Shannon entropy, Gini index, shape parameter of a Chi distribution, and shape parameter of a Weibull distribution) for sparsity characterization. It is shown that kurtosis, the Gini index, and the parametric sparsity measures are advantageous sparsity measures, whereas the l 1 -norm and entropy measures fail to robustly characterize the temporal sparsity of signals with a different number of time frames. Second, we propose to characterize the spectral sparsity of an utterance by initially time-aligning it to the same utterance uttered by a (arbitrarily selected) reference speaker using dynamic time warping. Experimental results on a Spanish database of healthy and dysarthric speech show that estimating the spectro-temporal sparsity using the Gini index or the parametric sparsity measures and using it as a feature in a support vector machine results in a high classification accuracy of 83.3%.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.