Current speaker recognition systems, that are learned by using wide training datasets and include sophisticated modelings, turn out to be very specific, providing sometimes disappointing results in real-life applications. Any shift between training and test data, in terms of device, language, duration, noise or other tends to degrade accuracy of speaker detection. This study investigates unsupervised domain adaptation,when only a scarce and unlabeled "in-domain" development dataset is available. Details and relevance of different approaches are described and commented, leading to a new robust method that we call feature-Distribution Adaptor. Efficiency of the proposed technique is experimentally validated on the recent NIST 2016 and 2018 Speaker Recognition Evaluation datasets.
Short speech duration remains a critical factor of performance degradation when deploying a speaker verification system. To overcome this difficulty, a large number of commercial applications impose the use of fixed pass-phrases. In this context, we show that the performance of the popular i-vector approach can be greatly improved by taking advantage of the phonetic information that they convey. Moreover, as i-vectors require a conditioning process to reach high accuracy, we show that further improvements are possible by taking advantage of this phonetic information within the normalisation process. We compare two methods, Within Class Covariance Normalization (WCCN) and Eigen Factor Radial (EFR), both relying on parameters estimated on the same development data. Our study suggests that WCCN is more robust to data mismatch but less efficient than EFR when the development data has a better match with the test data.
This paper proposes to learn a set of high-level feature representations through deep learning, referred to as Speaker Embeddings, for speaker diarization. Speaker Embedding features are taken from the hidden layer neuron activations of Deep Neural Networks (DNN), when learned as classifiers to recognize a thousand speaker identities in a training set. Although learned through identification, speaker embeddings are shown to be effective for speaker verification in particular to recognize speakers unseen in the training set. In particular, this approach is applied to speaker diarization. Experiments, conducted on the corpus of French broadcast news ETAPE, show that this new speaker modeling technique decreases DER by 1.67 points (a relative improvement of about 8% DER).
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.