The learning-based speech recovery approach using statistical spectral conversion has been used for some kind of distorted speech as alaryngeal speech and body-conducted speech (or bone-conducted speech). This approach attempts to recover clean speech (undistorted speech) from noisy speech (distorted speech) by converting the statistical models of noisy speech into that of clean speech without the prior knowledge on characteristics and distributions of noise source. Presently, this approach has still not attracted many researchers to apply in general noisy speech enhancement because of some major problems: those are the difficulties of noise adaptation and the lack of noise robust synthesizable features in different noisy environments. In this paper, we adopted the methods of state-of-the-art voice conversions and speaker adaptation in speech recognition to the proposed speech recovery approach applied in different kinds of noisy environment, especially in adverse environments with joint compensation of additive and convolutive noises. We proposed to use the decorrelated wavelet packet coefficients as a low-dimensional robust synthesizable feature under noisy environments. We also proposed a noise adaptation for speech recovery with the eigennoise similar to the eigenvoice in voice conversion. The experimental results showed that the proposed approach highly outperformed traditional nonlearning-based approaches.
Abstract-Present noisy speech enhancements algorithms are efficiently used for additive noise but not very good for convolutive noise as reverberation. And even for additive noise, the estimation of noise, when only one microphone source is provided, is based on the assumption of a slowly varying noise environment, commonly assumed as stationary noise. However, real noise is non-stationary noise, which difficult to be efficiently estimated. Spectral conversion can be used for predicting the vocal tract (spectral envelope) parameters of noisy speech without estimating the parameters of the noise source. Therefore, it can be applied to a general speech enhancement model, for both stationary and non-stationary additive noise environment, as well as convolutive noise environment, when only one microphone source is provided. In this paper, we propose a spectral conversion based speech enhancement method. The experimental results show that our method outperforms traditional methods.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.