Abstract-Support vector machines (SVMs), and kernel classifiers in general, rely on the kernel functions to measure the pairwise similarity between inputs. This paper advocates the use of discrete representation of speech signals in terms of the probabilities of discrete events as feature for speaker verification and proposes the use of Bhattacharyya coefficient as the similarity measure for this type of inputs to SVM. We analyze the effectiveness of the Bhattacharyya measure from the perspective of feature normalization and distribution warping in the SVM feature space. Experiments conducted on the NIST 2006 speaker verification task indicate that the Bhattacharyya measure outperforms the Fisher kernel, term frequency log-likelihood ratio (TFLLR) scaling, and rank normalization reported earlier in literature. Moreover, the Bhattacharyya measure is computed using a data-independent square-root operation instead of datadriven normalization, which simplifies the implementation. The effectiveness of the Bhattacharyya measure becomes more apparent when channel compensation is applied at the model and score levels. The performance of the proposed method is close to that of the popular GMM supervector with a small margin.Index Terms-Bhattacharyya coefficient, speaker verification, support vector machine, supervector.
I. INTRODUCTIONPEAKER verification is the task of verifying the identity of a person using his/her voice [1]. The verification process typically consists of extracting a sequence of short-term spectral vectors from the given speech signal, matching the sequence of vectors against the claimed speaker's model, and finally comparing the matched score against a verification threshold. Recent advances reported in [1][2][3][4][5][6][7][8] show an emerging trend in using support vector machines (SVMs) for Copyright (c) 2010 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request to pubs-permissions@ieee.org.Manuscript received December 11, 2009. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Nestor Becerra Yoma.Kong Aik Lee, Chang Huai You, and Haizhou Li are with the Institute for Infocomm Research, Agency for Science, Technology and Research (A*STAR), Singapore. (e-mail: kalee@i2r.a-star.edu.sg; echyou@i2r.astar.edu.sg; hli@i2r.a-star.edu.sg). The work of Haizhou Li was partially supported by Nokia Foundation.Tomi Kinnunen is with the School of Computing, University of Eastern Finland, Finland (e-mail: tkinnu@cs.joensuu.fi). The work of T. Kinnunen was supported by the Academy of Finland (project no. 132129, "Characterizing individual information in speech").Khe Chai Sim is with the School of Computing, National University of Singapore, Singapore (e-mail: simkc@comp.nus.edu.sg). speaker modeling. One reason for the popularity of SVM is its good generalization performance.The key issue in using SVM for classifying speech signals, which have a va...