Abstract. Fusion techniques have been widely used in multi-modal biometric authentication systems. While these techniques are mainly applied to combine the outputs of modality-dependent classifiers, they can also be applied to fuse the decisions or scores from a single modality. The idea is to consider the multiple samples extracted from a single modality as independent but coming from the same source. In this chapter, we propose a single-source, multi-sample data-dependent fusion algorithm for speaker verification. The algorithm is data-dependent in that the fusion weights are dependent on the verification scores and the prior score statistics of claimed speakers and background speakers. To obtain the best out of the speaker's scores, scores from multiple utterances are sorted before they are probabilistically combined. Evaluations based on 150 speakers from a GSM-transcoded corpus are presented. Results show that data-dependent fusion of speaker's scores is significantly better than the conventional score averaging approach. It was also found that the proposed fusion algorithm can be further enhanced by sorting the score sequences before they are probabilistically combined.
This paper proposes a single-source multi-sample fusion approach to text-independent speaker verification. In conventional speaker verification systems, the scores obtained from claimant's utterances are averaged and the resulting mean score is used for decision making. Instead of using an equal weight for all scores, this paper proposes assigning a different weight to each score, where the weights are made dependent on the difference between the score values and a speaker-dependent reference score obtained during enrollment. Because the fusion weights depend on the verification scores, a technique called constrained stochastic feature transformation is applied to minimize the mismatch between enrollment and verification data in order to enhance the scores' reliability. Experimental results based on the 2001 NIST evaluation set show that the proposed fusion approach outperforms the equal-weight approach by 22% in terms of equal error rate and 16% in terms of minimum detection cost.
In speaker verification, a claimant may produce two or more utterances. Typically, the scores of the speech patterns extracted from these utterances are averaged and the resulting mean score is compared with a decision threshold. Rather than simply computing the mean score, we propose to compute the optimal weights for fusing the scores based on the score distribution of the independent utterances and our prior knowledge about the score statistics. More specifically, we use enrollment data to compute the mean scores of client speakers and impostors and consider them to be the prior scores. During verification, we set the fusion weights for individual speech patterns to be a function of the dispersion between the scores of these speech patterns and the prior scores. Experimental results based on the GSM-transcoded speech of 150 speakers from the HTIMIT corpus demonstrate that the proposed fusion algorithm can increase the dispersion between the mean speaker scores and the mean impostor scores. Compared with a baseline approach where equal weights are assigned to all scores, the proposed approach provides a relative error reduction of 19%.
This paper proposes a constrained stochastic feature transformation algorithm for robust speaker verification. The algorithm computes the feature transformation parameters based on the statistical difference between a test utterance and a composite GMM formed by combining the speaker and background models. The transformation is then used to transform the test utterance to fit the clean speaker model and background model before verification. By implicitly constraining the transformation, the transformed features can fit both models simultaneously. Experimental results based on the 2001 NIST evaluation set show that the proposed algorithms achieves significant improvement in both equal error rate and minimum detection cost when compared to cepstral mean subtraction and Z-norm. The performance of the proposed transformation approach is also slightly better than the short-time Gaussianization method proposed in [1].
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.