HighlightsAssessment of speaker similarity combining source and filter voice characteristics.Feature selection method to determine the most parsimonious feature subset.Testing with very similar-sounding speakers, i.e. monozygotic twins (MZ).Testing using high quality and telephone-filtered recordings.Significant differences between same-speaker and different-speaker comparisons.
Among phoneticians, the Vocal Profile Analysis (VPA) is one of the most widely used methods for the componential assessment of voice quality. Whether the ultimate goal of the VPA evaluation is the comparative description of languages or the characterization of an individual speaker, the VPA protocol shows great potential for different research areas of speech communication. However, its use is not without practical difficulties. Despite these, methodological studies aimed at explaining where, when and why issues arise during the perceptual assessment process are rare. In this paper we describe the methodological stages through which three analysts evaluated the voices of 99 Standard Southern British English male speakers, rated their voices using the VPA scheme, discussed inter-rater disagreements, and eventually produced an agreed version of VPA scores. These scores were then used to assess correlations between settings. We show that it is possible to reach a good degree of inter-rater agreement, provided that several calibration and training sessions are conducted. We further conclude that the perceptual assessment of voice quality using the VPA scheme is an essential tool in fields such as forensic phonetics but, foremost, that it can be adapted and modified to a range of research areas, and not necessarily limited to the evaluation of pathological voices in clinical settings.
Agreement results suggest that the proposed SVPA is a reliable protocol for the perceptual characterization of VQ, and SMC results confirm that it can also be a useful tool for the assessment of speaker (dis)similarity. The extraction of a voice quality similarity index shows potential in fields like forensic phonetics, but could also be of interest in related areas of voice research and professional practice.
The performance of the automatic speaker recognition (ASR) system Batvox TM (Version 4.1) has been tested with a male population of 24 monozygotic (MZ) twins, 10 dizygotic (DZ) twins, 8 non-twin siblings and 12 unrelated speakers (aged 18-52 with Standard Peninsular Spanish as their mother tongue). Since the cepstral features in which this ASR system is based depend largely on anatomical-physiological foundations, we hypothesized that such features ought to be gene-dependent. Therefore, higher similarity values should be found in MZ twins (100% shared genes) than in DZ twins, in brothers (B) or in a reference population of unrelated speakers (US). Results corroborated the expected decreasing scale MZ > DZ > B > US since the similarity coefficients yielded by the automatic system for these speakers decreased exactly in the same direction as the kinship degree of the four speaker groups diminishes. This suggests that the system features are to a great extent genetically conditioned and that they are hence useful and robust for comparing speech samples of known and unknown origin, as found in legal cases. Furthermore, the 9.9% EER (Equal Error Rate) obtained when testing MZ pairs lies around the same value (11% EER) found in Künzel (2010) with German twins.
In forensic voice comparison, there is increasing focus on the integration of automatic and phonetic methods to improve the validity and reliability of voice evidence to the courts. In line with this, we present a comparison of long-term measures of the speech signal to assess the extent to which they capture complementary speaker-specific information. Likelihood ratiobased testing was conducted using MFCCs and (linear and Melweighted) long-term formant distributions (LTFDs). Fusing automatic and semi-automatic systems yielded limited improvement in performance over the baseline MFCC system, indicating that these measures capture essentially the same speaker-specific information. The output from the best performing system was used to evaluate the contribution of auditory-based analysis of supralaryngeal (filter) and laryngeal (source) voice quality in system testing. Results suggest that the problematic speakers for the (semi-)automatic system are, to some extent, predictable from their supralaryngeal voice quality profiles, with the least distinctive speakers producing the weakest evidence and most misclassifications. However, the misclassified pairs were still easily differentiated via auditory analysis. Laryngeal voice quality may thus be useful in resolving problematic pairs for (semi-)automatic systems, potentially improving their overall performance.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.