Temporal modulation processing is a promising technique for improving the intelligibility and quality of speech in noise. We propose a speech enhancement algorithm to construct the temporal envelope (TEV) in the time-frequency domain by means of an embedded convolutional neural network (CNN). To accomplish this, the input speech signals are divided into sixteen parallel frequency bands (subbands) with bandwidths approximating 1.5 times that of auditory filters. The corrupted TEVs in each subband are extracted and then fed to the 1-dimensional CNN (1-D CNN) model to restore the TEVs distorted by noise. The method is evaluated using 2,700 words from nine different talkers, which are mixed with speech-spectrum shaped random noise (SSN), and babble noise, at different signal-to-noise ratios. The Short-time Objective Intelligibility (STOI) and Perceptual Evaluation of Speech Quality (PESQ) metrics are used to evaluate the performance of the 1-D CNN algorithm. Results suggest that the 1-D CNN model improves STOI scores by 27% and 34% for SSN and babble noise, respectively, and PESQ scores by 19% and 17%, respectively, compared to a conventional TEV-based speech enhancement algorithm.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.