During the last decades, automatic speech processing systems witnessed an important progress and achieved remarkable reliability. As a result, such technologies have been exploited in new areas and applications including medical practice. In disordered speech evaluation context, perceptual evaluation is still the most common method used in clinical practice for the diagnosing and the following of the condition progression of patients despite its well documented limits (such as subjectivity). In this paper, we propose an automatic approach for the prediction of dysarthric speech evaluation metrics (intelligibility, severity, articulation impairment) based on the representation of the speech acoustics in the total variability subspace based on the i-vectors paradigm. The proposed approach, evaluated on 129 French dysarthric speakers from the DesPhoAPady and VML databases, is proven to be efficient for the modeling of patient's production and capable of detecting the evolution of speech quality. Also, low RMSE and high correlation measures are obtained between automatically predicted metrics and perceptual evaluations.
State-of-the-art speaker recognition systems performance degrades considerably in noisy environments even though they achieve very good results in clean conditions. In order to deal with this strong limitation, we aim in this work to remove the noisy part of an i-vector directly in the i-vector space. Our approach offers the advantage to operate only at the i-vector extraction level, letting the other steps of the system unchanged. A maximum a posteriori (MAP) procedure is applied in order to obtain clean version of the noisy i-vectors taking advantage of prior knowledge about clean i-vectors distribution. To perform this MAP estimation, Gaussian assumptions over clean and noise i-vectors distributions are made. Operating on NIST 2008 data, we show a relative improvement up to 60% compared with baseline system. Our approach also outperforms the "multi-style" backend training technique. The efficiency of the proposed method is obtained at the price of relative high computational cost. We present at the end some ideas to improve this aspect.
Forensic Voice Comparison (FVC) is increasingly using the likelihood ratio (LR) in order to indicate whether the evidence supports the prosecution (same-speaker) or defender (different-speakers) hypotheses. Nevertheless, the LR accepts some practical limitations due both to its estimation process itself and to a lack of knowledge about the reliability of this (practical) estimation process. It is particularly true when FVC is considered using Automatic Speaker Recognition (ASR) systems. Indeed, in the LR estimation performed by ASR systems, different factors are not considered such as speaker intrinsic characteristics, denoted "speaker factor", the amount of information involved in the comparison as well as the phonological content and so on. This article focuses on the impact of phonological content on FVC involving two different speakers and more precisely the potential implication of a specific phonemic category on wrongful conviction cases (innocents are send behind bars). We show that even though the vast majority of speaker pairs (more than 90%) are well discriminated, few pairs are difficult to distinguish. For the "best" discriminated pairs, all the phonemic content play a positive role in speaker discrimination while for the "worst" pairs, it appears that nasals have a negative effect and lead to a confusion between speakers.
Once the i-vector paradigm has been introduced in the field of speaker recognition, many techniques have been proposed to deal with additive noise within this framework. Due to the complexity of its effect in the i-vector space, a lot of effort has been put into dealing with noise in other domains (speech enhancement, feature compensation, robust i-vector extraction and robust scoring). As far as we know, there was no serious attempt to handle the noise problem directly in the i-vector space without relying on data distributions computed on a prior domain. The aim of this paper is twofold. First, it proposes a fullcovariance Gaussian modeling of the clean i-vectors and noise distribution in the i-vector space and introduces a technique to estimate a clean i-vector given the noisy version and the noise density function using the MAP approach. Based on NIST data, we show that it is possible to improve by up to 60% the baseline system performance. Second, in order to make this algorithm usable in a real application and reduce the computational time needed by i-MAP, we propose an extension that requires building a noise distribution database in the i-vector space in an off-line step and using it later in the test phase. We show that it is possible to achieve comparable results using this approach (up to 57% of relative EER improvement) with a sufficiently large noise distribution database.
Speaker recognition with short utterance is highly challenging. The use of i-vectors in SR systems became a standard in the last years and many algorithms were developed to deal with the short utterances problem. We present in this paper a new technique based on modeling jointly the i-vectors corresponding to short utterances and those of long utterances. The joint distribution is estimated using a large number of i-vectors pairs (coming from short and long utterances) corresponding to the same session. The obtained distribution is then integrated in an MMSE estimator in the test phase to compute an "improved" version of short utterance i-vectors. We show that this technique can be used to deal with duration mismatch and that it achieves up to 40% of relative improvement in EER(%) when used on NIST data. We also apply this technique on the recently published SITW database and show that it yields 25% of EER(%) improvement compared to a regular PLDA scoring.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.