Voice conversion -the methodology of automatically converting one's utterances to sound as if spoken by another speaker -presents a threat for applications relying on speaker verification. We study vulnerability of text-independent speaker verification systems against voice conversion attacks using telephone speech. We implemented a voice conversion systems with two types of features and nonparallel frame alignment methods and five speaker verification systems ranging from simple Gaussian mixture models (GMMs) to state-of-the-art joint factor analysis (JFA) recognizer. Experiments on a subset of NIST 2006 SRE corpus indicate that the JFA method is most resilient against conversion attacks. But even it experiences more than 5-fold increase in the false acceptance rate from 3.24 % to 17.33 %.
Abstract-In speech and audio applications, short-term signal spectrum is often represented using mel-frequency cepstral coefficients (MFCCs) computed from a windowed discrete Fourier transform (DFT). Windowing reduces spectral leakage but variance of the spectrum estimate remains high. An elegant extension to windowed DFT is the so-called multitaper method which uses multiple time-domain windows (tapers) with frequencydomain averaging. Multitapers have received little attention in speech processing even though they produce low-variance features. In this paper, we propose the multitaper method for MFCC extraction with a practical focus. We provide, firstly, detailed statistical analysis of MFCC bias and variance using autoregressive process simulations on the TIMIT corpus.
Figure 1: A Gaussian mixture model (GMM) with universal backround model (UBM).User-dependent models are adapted from the UBM and the recognition score is normalized using the UBM score.
AbstractWe propose a person authentication system using eye movement signals. In security scenarios, eye-tracking has earlier been used for gaze-based password entry. A few authors have also used physical features of eye movement signals for authentication in a taskdependent scenario with matched training and test samples. We propose and implement a task-independent scenario whereby the training and test samples can be arbitrary. We use short-term eye gaze direction to construct feature vectors which are modeled using Gaussian mixtures. The results suggest that there are personspecific features in the eye movements that can be modeled in a task-independent manner. The range of possible applications extends beyond the security-type of authentication to proactive and user-convenience systems.
State-of-the-art speaker verification systems consists of a number of complementary subsystems whose outputs are fused, to arrive at more accurate and reliable verification decision. In speaker verification, fusion is typically implemented as a linear combination of the subsystem scores. Parameters of the linear model are commonly estimated using the logistic regression method, as implemented in the popular FoCal toolkit. In this paper, we study simultaneous use of classifier selection and fusion. We study four alternative fusion strategies, three score warping techniques, and provide interesting experimental bounds on optimal classifier subset selection. Detailed experiments are carried out on the NIST 2008 and 2010 SRE corpora.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.