Despite considerable effort, monaural (single-microphone) algorithms capable of increasing the intelligibility of speech in noise have remained elusive. Successful development of such an algorithm is especially important for hearing-impaired (HI) listeners, given their particular difficulty in noisy backgrounds. In the current study, an algorithm based on binary masking was developed to separate speech from noise. Unlike the ideal binary mask, which requires prior knowledge of the premixed signals, the masks used to segregate speech from noise in the current study were estimated by training the algorithm on speech not used during testing. Sentences were mixed with speech-shaped noise and with babble at various signal-to-noise ratios (SNRs). Testing using normal-hearing and HI listeners indicated that intelligibility increased following processing in all conditions. These increases were larger for HI listeners, for the modulated background, and for the least-favorable SNRs. They were also often substantial, allowing several HI listeners to improve intelligibility from scores near zero to values above 70%.
Supervised speech segregation has been recently shown to improve human speech intelligibility in noise, when trained and tested on similar noises. However, a major challenge involves the ability to generalize to entirely novel noises. Such generalization would enable hearing aid and cochlear implant users to improve speech intelligibility in unknown noisy environments. This challenge is addressed in the current study through large-scale training. Specifically, a deep neural network (DNN) was trained on 10 000 noises to estimate the ideal ratio mask, and then employed to separate sentences from completely new noises (cafeteria and babble) at several signal-to-noise ratios (SNRs). Although the DNN was trained at the fixed SNR of À 2 dB, testing using hearing-impaired listeners demonstrated that speech intelligibility increased substantially following speech segregation using the novel noises and unmatched SNR conditions of 0 dB and 5 dB. Sentence intelligibility benefit was also observed for normal-hearing listeners in most noisy conditions. The results indicate that DNN-based supervised speech segregation with large-scale training is a very promising approach for generalization to new acoustic environments.
Machine learning algorithms to segregate speech from background noise hold considerable promise for alleviating limitations associated with hearing impairment. One of the most important considerations for implementing these algorithms into devices such as hearing aids and cochlear implants involves their ability to generalize to conditions not employed during the training stage. A major challenge involves the generalization to novel noise segments. In the current study, sentences were segregated from multi-talker babble and from cafeteria noise using an algorithm that employs deep neural networks to estimate the ideal ratio mask. Importantly, the algorithm was trained on segments of noise and tested using entirely novel segments of the same nonstationary noise type. Substantial sentence-intelligibility benefit was observed for hearing-impaired listeners in both noise types, despite the use of unseen noise segments during the test stage. Interestingly, normal-hearing listeners displayed benefit in babble but not in cafeteria noise. This result highlights the importance of evaluating these algorithms not only in human subjects, but in members of the actual target population.
Band-importance functions were created using the "compound" technique [Apoux and Healy, J. Acoust. Soc. Am. 132, 1078-1087 (2012)] that accounts for the multitude of synergistic and redundant interactions that take place among speech bands. Functions were created for standard recordings of the speech perception in noise (SPIN) sentences and the Central Institute for the Deaf (CID) W-22 words using 21 critical-band divisions and steep filtering to eliminate the influence of filter slopes. On a given trial, a band of interest was presented along with four other bands having spectral locations determined randomly on each trial. In corresponding trials, the band of interest was absent and only the four other bands were present. The importance of the band of interest was determined by the difference between paired band-present and band-absent trials. Because the locations of the other bands changed randomly from trial to trial, various interactions occurred between the band of interest and other speech bands which provided a general estimate of band importance. Obtained band-importance functions differed substantially from those currently available for identical speech recordings. In addition to differences in the overall shape of the functions, especially for the W-22 words, a complex microstructure was observed in which the importance of adjacent frequency bands often varied considerably. This microstructure may result in better predictive power of the current functions.
The present study investigated the role and relative contribution of envelope and temporal fine structure (TFS) to sentence recognition in noise. Target and masker stimuli were added at five different signal-to-noise ratios (SNRs) and filtered into 30 contiguous frequency bands. The envelope and TFS were extracted from each band by Hilbert decomposition. The final stimuli consisted of the envelope of the target/masker sound mixture at x dB SNR and the TFS of the same sound mixture at y dB SNR. A first experiment showed a very limited contribution of TFS cues, indicating that sentence recognition in noise relies almost exclusively on temporal envelope cues. A second experiment showed that replacing the carrier of a sound mixture with noise (vocoder processing) cannot be considered equivalent to disrupting the TFS of the target signal by adding a background noise. Accordingly, a re-evaluation of the vocoder approach as a model to further understand the role of TFS cues in noisy situations may be necessary. Overall, these data are consistent with the view that speech information is primarily extracted from the envelope while TFS cues are primarily used to detect glimpses of the target.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.