Speaker Verification Using Support Vector Machines and High-Level Features

Campbell, William M.; Campbell, Joseph P.; Gleason, Terry P.; Reynolds, Douglas A.; Shen, Wade

doi:10.1109/tasl.2007.902874

Cited by 63 publications

(32 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The so-called higher level features provide information complementary to classic spectral features and they make the system more robust [3,6,11,18,20]. In this work four types of features were used: spectral, prosodic, articulatory, and lexical.…”

Section: Multi-level Speaker Recognitionmentioning

confidence: 99%

Speaker recognition based on multilevel speech signal analysis on Polish corpus

Drgas

Dąbrowski

2013

Multimed Tools Appl

View full text Add to dashboard Cite

This article deals with a new approach to the text-independent speaker verification task. It is namely proposed to combine spectral and the so-called highlevel features (prosodic, articulatory, and lexical) in order to increase accuracy of speaker verification. The presented experiments were performed using a Polish language corpus developed by the authors, the so-called PUEPS corpus. It contains semi-spontaneous telephone conversations (acted emergency telephone notifications) recorded in laboratory conditions. As the Polish language is under resourced and the PUEPS corpus is relatively small, in this case a new approach is needed, other than these well known from NIST (National Institute of Standards and Technology) evaluations. The authors proposed to use the fast scoring instead of more complex classifiers and the AdaBoost (adaptive boosting) algorithm for features combination. Combination of features resulted in the equal error rate (EER) reduction for various SNR (signal-to-noise ratio) conditions. Additionally, score normalization methods were evaluated. It was shown that significant benefits can be obtained using the z-norm2 method.

show abstract

Section: Multi-level Speaker Recognitionmentioning

confidence: 99%

Speaker recognition based on multilevel speech signal analysis on Polish corpus

Drgas

Dąbrowski

2013

Multimed Tools Appl

View full text Add to dashboard Cite

show abstract

“…Term frequency log-likelihood ratio (TFLLR) was introduced in [10] for the scaling of n-gram probabilities. Since each n-gram can be regarded as a discrete event i e , the n-gram probabilities can be expressed as PMF and supervector as given in (11).…”

Section: A Term Frequency Log-likelihood Ratio (Tfllr)mentioning

confidence: 99%

“…Notably, high-level feature extraction (e.g., idiolect, phonotactic, prosody) usually produces discrete symbols. For instance, in [10] speech signals are converted into sequences of phone symbols and then represented in terms of phone n-gram probabilities. Discrete probabilities are also useful in modeling prosodic feature sequences [11].…”

Section: Introductionmentioning

confidence: 99%

“…The Bhattacharyya measure is symmetric as opposed to other probabilistic measures such as Kullback-Leibler (KL) divergence [14], which is non-symmetric and has to be simplified and approximated substantially to arrive at a symmetric kernel. While the Bhattacharyya measure is simpler, data-independent and more effective, we will also show how it is related to and different from the Fisher kernel [2], term frequency log-likelihood ratio (TFLLR) [10], and rank normalization [15] proposed earlier for similar form of supervectors.The remainder of this paper is organized as follows. We introduce the MAP framework for the estimation of discrete probabilities in Section II.…”

mentioning

confidence: 99%

“…The discrete events may also correspond to abstract linguistic units such as phonemes, syllables, words, or subsequences of n symbols (i.e., n-grams). For instance, in spoken language recognition [16] and speaker recognition utilizing high-level features [10,11], the events represent n-grams of phones, words or some prosodic features. In these methods, phone recognizers or prosodic feature extractors are used to discover the events set from the speech signals.…”

mentioning

confidence: 99%

See 2 more Smart Citations

Using Discrete Probabilities With Bhattacharyya Measure for SVM-Based Speaker Verification

Lee

You

et al. 2011

IEEE Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

Abstract-Support vector machines (SVMs), and kernel classifiers in general, rely on the kernel functions to measure the pairwise similarity between inputs. This paper advocates the use of discrete representation of speech signals in terms of the probabilities of discrete events as feature for speaker verification and proposes the use of Bhattacharyya coefficient as the similarity measure for this type of inputs to SVM. We analyze the effectiveness of the Bhattacharyya measure from the perspective of feature normalization and distribution warping in the SVM feature space. Experiments conducted on the NIST 2006 speaker verification task indicate that the Bhattacharyya measure outperforms the Fisher kernel, term frequency log-likelihood ratio (TFLLR) scaling, and rank normalization reported earlier in literature. Moreover, the Bhattacharyya measure is computed using a data-independent square-root operation instead of datadriven normalization, which simplifies the implementation. The effectiveness of the Bhattacharyya measure becomes more apparent when channel compensation is applied at the model and score levels. The performance of the proposed method is close to that of the popular GMM supervector with a small margin.Index Terms-Bhattacharyya coefficient, speaker verification, support vector machine, supervector. I. INTRODUCTIONPEAKER verification is the task of verifying the identity of a person using his/her voice [1]. The verification process typically consists of extracting a sequence of short-term spectral vectors from the given speech signal, matching the sequence of vectors against the claimed speaker's model, and finally comparing the matched score against a verification threshold. Recent advances reported in [1][2][3][4][5][6][7][8] show an emerging trend in using support vector machines (SVMs) for Copyright (c) 2010 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request to pubs-permissions@ieee.org.Manuscript received December 11, 2009. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Nestor Becerra Yoma.Kong Aik Lee, Chang Huai You, and Haizhou Li are with the Institute for Infocomm Research, Agency for Science, Technology and Research (A*STAR), Singapore. (e-mail: kalee@i2r.a-star.edu.sg; echyou@i2r.astar.edu.sg; hli@i2r.a-star.edu.sg). The work of Haizhou Li was partially supported by Nokia Foundation.Tomi Kinnunen is with the School of Computing, University of Eastern Finland, Finland (e-mail: tkinnu@cs.joensuu.fi). The work of T. Kinnunen was supported by the Academy of Finland (project no. 132129, "Characterizing individual information in speech").Khe Chai Sim is with the School of Computing, National University of Singapore, Singapore (e-mail: simkc@comp.nus.edu.sg). speaker modeling. One reason for the popularity of SVM is its good generalization performance.The key issue in using SVM for classifying speech signals, which have a va...

show abstract