This article presents an overview of twelve existing objective speech quality and intelligibility prediction tools. Two classes of algorithms are presented, namely intrusive and non-intrusive, with the former requiring the use of a reference signal, while the latter does not. Investigated metrics include both those developed for normal hearing listeners, as well as those tailored particularly for hearing impaired (HI) listeners who are users of assistive listening devices (i.e., hearing aids, HAs, and cochlear implants, CIs). Representative examples of those optimized for HI listeners include the speech-to-reverberation modulation energy ratio, tailored to hearing aids (SRMR-HA) and to cochlear implants (SRMR-CI); the modulation spectrum area (ModA); the hearing aid speech quality (HASQI) and perception indices (HASPI); and the PErception MOdel - hearing impairment quality (PEMO-Q-HI). The objective metrics are tested on three subjectively-rated speech datasets covering reverberation-alone, noise-alone, and reverberation-plus-noise degradation conditions, as well as degradations resultant from nonlinear frequency compression and different speech enhancement strategies. The advantages and limitations of each measure are highlighted and recommendations are given for suggested uses of the different tools under specific environmental and processing conditions.
Non-intrusive speech intelligibility metrics are based solely on the corrupted speech information and a prior model of the speech signal in a given representation. As such, any sources of variability not taken into account by the model will affect the metric's performance. In this paper, we investigate two sources of variability in the auditory-inspired model used by the speech-to-reverberation modulation energy ratio (SRMR) metric, namely speech content and pitch, and propose two updates that aim to reduce the variability caused by these sources. First, we limited the dynamic range of the energies in the modulation spectrum bands in order to reduce the effect of speech content and speaker variability. Second, the range of the modulation filter bank was modified to reduce the variability due to pitch. Experimental results show that the updated metric presents higher performance and lower variability relative to the original SRMR when assessing speech intelligibility in noisy and reverberant environments, as well as outperforms several standard intrusive and non-intrusive benchmark metrics.
Singing voice separation based on deep learning relies on the usage of time-frequency masking. In many cases the masking process is not a learnable function or is not encapsulated into the deep learning optimization. Consequently, most of the existing methods rely on a post processing step using the generalized Wiener filtering. This work proposes a method that learns and optimizes (during training) a source-dependent mask and does not need the aforementioned post processing step. We introduce a recurrent inference algorithm, a sparse transformation step to improve the mask generation process, and a learned denoising filter. Obtained results show an increase of 0.49 dB for the signal to distortion ratio and 0.30 dB for the signal to interference ratio, compared to previous state-of-the-art approaches for monaural singing voice separation.
With the increasing prevalence of Autism Spectrum Disorders (ASD), very early detection has become a key priority research topic, as early interventions can increase the chances of success. Since atypical communication is a hallmark of ASD, automated acoustic-prosodic analyses have received prominent attention. Existing studies, however, have focused on verbal children, typically over the age of three (when many children may be reliably diagnosed) and as high as early teens. Here, an acoustic-prosodic analysis of pre-verbal vocalizations (e.g., babbles, cries) of 18-month old toddlers is performed. Data was obtained from a prospective longitudinal study looking at high-risk siblings of children with ASD who were also diagnosed with ASD, as well as low-risk age-matched typically developing controls. Several acousticprosodic features were extracted and used to train support vector machine and probabilistic neural network classifiers; classification accuracy as high as 97% was obtained. Our findings suggest that markers of autism may be present in pre-verbal vocalizations of 18-month old toddlers, thus may be used to assist clinicians with very early detection of ASD.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.