Pooyan Safari scite author profile

Most state-of-the-art Deep Learning (DL) approaches for speaker recognition work on a short utterance level. Given the speech signal, these algorithms extract a sequence of speaker embeddings from short segments and those are averaged to obtain an utterance level speaker representation. In this work we propose the use of an attention mechanism to obtain a discriminative speaker embedding given non fixed length speech utterances. Our system is based on a Convolutional Neural Network (CNN) that encodes short-term speaker features from the spectrogram and a self multi-head attention model that maps these representations into a long-term speaker embedding. The attention model that we propose produces multiple alignments from different subsegments of the CNN encoded states over the sequence. Hence this mechanism works as a pooling layer which decides the most discriminative features over the sequence to obtain an utterance level representation. We have tested this approach for the verification task for the VoxCeleb1 dataset. The results show that self multi-head attention outperforms both temporal and statistical pooling methods with a 18% of relative EER. Obtained results show a 58% relative improvement in EER compared to i-vector+PLDA.

show abstract

Self-Attention Encoding and Pooling for Speaker Recognition

Safari

India

Hernando

2020

View full text Add to dashboard Cite

From Features to Speaker Vectors by means of Restricted Boltzmann Machine Adaptation

Safari¹,

Ghahabi²,

Hernando³

2016

View full text Add to dashboard Cite

Restricted Boltzmann Machines (RBMs) have shown success in different stages of speaker recognition systems. In this paper, we propose a novel framework to produce a vector-based representation for each speaker, which will be referred to as RBMvector. This new approach maps the speaker spectral features to a single fixed-dimensional vector carrying speaker-specific information. In this work, a global model, referred to as Universal RBM (URBM), is trained taking advantage of RBM unsupervised learning capabilities. Then, this URBM is adapted to the data of each speaker in the development, enrolment and evaluation datasets. The network connection weights of the adapted RBMs are further concatenated and subject to a whitening with dimension reduction stage to build the speaker vectors. The evaluation is performed on the core test condition of the NIST SRE 2006 database, and it is shown that RBM-vectors achieve 15% relative improvement in terms of EER compared to i-vectors using cosine scoring. The score fusion with i-vector attains more than 24% relative improvement. The interest of this result for score fusion yields on the fact that both vectors are produced in an unsupervised fashion and can be used instead of i-vector/PLDA approach, when no data label is available. Results obtained for RBM-vector/PLDA framework is comparable with the ones from i-vector/PLDA. Their score fusion achieves 14% relative improvement compared to i-vector/PLDA.

show abstract

Self Multi-Head Attention for Speaker Recognition

India¹,

Safari²,

Hernando³

2019

Preprint

View full text Add to dashboard Cite

Feature classification by means of deep belief networks for speaker recognition

Safari

Ghahabi

Hernando

2015

View full text Add to dashboard Cite

In this paper, we propose to discriminatively model target and impostor spectral features using Deep Belief Networks (DBNs) for speaker recognition. In the feature level, the number of impostor samples is considerably large compared to previous works based on i-vectors. Therefore, those i-vector based impostor selection algorithms are not computationally practical. On the other hand, the number of samples for each target speaker is different from one speaker to another which makes the training process more difficult. In this work, we take advantage of DBN unsupervised learning to train a global model, which will be referred to as Universal DBN (UDBN). Then we adapt this UDBN to the data of each target speaker. The evaluation is performed on the core test condition of the NIST SRE 2006 database and it is shown that the proposed architecture achieves more than 8% relative improvement in comparison to the conventional Multilayer Perceptron (MLP).

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Pooyan Safari

Self Multi-Head Attention for Speaker Recognition

Self-Attention Encoding and Pooling for Speaker Recognition

From Features to Speaker Vectors by means of Restricted Boltzmann Machine Adaptation

Self Multi-Head Attention for Speaker Recognition

Feature classification by means of deep belief networks for speaker recognition

Contact Info

Product

Resources

About