Speech intelligibility models typically consist of a preprocessing part that transforms stimuli into some internal (auditory) representation and a decision metric that relates the internal representation to speech intelligibility. The present study analyzed the role of modulation filtering in the preprocessing of different speech intelligibility models by comparing predictions from models that either assume a spectro-temporal (i.e., two-dimensional) or a temporal-only (i.e., one-dimensional) modulation filterbank. Furthermore, the role of the decision metric for speech intelligibility was investigated by comparing predictions from models based on the signal-to-noise envelope power ratio, SNRenv, and the modulation transfer function, MTF. The models were evaluated in conditions of noisy speech (1) subjected to reverberation, (2) distorted by phase jitter, or (3) processed by noise reduction via spectral subtraction. The results suggested that a decision metric based on the SNRenv may provide a more general basis for predicting speech intelligibility than a metric based on the MTF. Moreover, the one-dimensional modulation filtering process was found to be sufficient to account for the data when combined with a measure of across (audio) frequency variability at the output of the auditory preprocessing. A complex spectro-temporal modulation filterbank might therefore not be required for speech intelligibility prediction.
This study proposes a binaural extension to the multi-resolution speech-based envelope power spectrum model (mr-sEPSM) [Jørgensen, Ewert, and Dau (2013). J. Acoust. Soc. Am. 134, 436-446]. It consists of a combination of better-ear (BE) and binaural unmasking processes, implemented as two monaural realizations of the mr-sEPSM combined with a short-term equalization-cancellation process, and uses the signal-to-noise ratio in the envelope domain (SNRenv) as the decision metric. The model requires only two parameters to be fitted per speech material and does not require an explicit frequency weighting. The model was validated against three data sets from the literature, which covered the following effects: the number of maskers, the masker types [speech-shaped noise (SSN), speech-modulated SSN, babble, and reversed speech], the masker(s) azimuths, reverberation on the target and masker, and the interaural time difference of the target and masker. The Pearson correlation coefficient between the simulated speech reception thresholds and the data across all experiments was 0.91. A model version that considered only BE processing performed similarly (correlation coefficient of 0.86) to the complete model, suggesting that BE processing could be considered sufficient to predict intelligibility in most realistic conditions.
Intelligibility models provide insights regarding the effects of target speech characteristics, transmission channels and/or auditory processing on the speech perception performance of listeners. In 2011, Jørgensen and Dau proposed the speech-based envelope power spectrum model [sEPSM, Jørgensen and Dau (2011). J. Acoust. Soc. Am. 130(3), 1475-1487]. It uses the signal-to-noise ratio in the modulation domain (SNRenv) as a decision metric and was shown to accurately predict the intelligibility of processed noisy speech. The sEPSM concept has since been applied in various subsequent models, which have extended the predictive power of the original model to a broad range of conditions. This contribution presents the most recent developments within the sEPSM “family:” (i) A binaural extension, the B-sEPSM [Chabot-Leclerc et al. (2016). J. Acoust. Soc. Am. 140(1), 192-205] which combines better-ear and binaural unmasking processes and accounts for a large variety of spatial phenomena in speech perception; (ii) a correlation-based version [Relaño-Iborra et al. (2016). J. Acoust. Soc. Am. 140(4), 2670-2679] which extends the predictions of the early model to non-linear distortions, such as phase jitter and binary mask-processing; and (iii) a recent physiologically inspired extension, which allows to functionally account for effects of individual hearing impairment on speech perception.
Speech intelligibility models consist of a preprocessing part that transforms the stimuli into some internal (auditory) representation, and a decision metric that quantifies effects of transmission channel, speech interferers, and auditory processing on the speech intelligibility. Here, two recent speech intelligibility models, the spectro-temporal modulation index (STMI; Elhilali et al., 2003) and the speech-based envelope power spectrum model (sEPSM; Jørgensen and Dau, 2011) were evaluated in conditions of noisy speech subjected to reverberation, and to nonlinear distortions through either a phase jitter process or noise reduction via spectral subtraction. The contributions of the individual preprocessing stages in the models and the role of the decision metrics were analyzed in the different experimental conditions. It is demonstrated that an explicit across-frequency envelope processing stage, as assumed in the STMI, together with the metric based on the envelope power signal-to-noise ratio, as assumed in the sEPSM, are required to account for all three conditions. However, a simple across audio-frequency mechanism combined with a purely temporal modulation filterbank is assumed to be sufficient to describe the data, i.e., a joint two-dimensional modulation filterbank might not be required.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.