Spoken Language Recognition using X-vectors

Snyder, David; Garcia‐Romero, Daniel; McCree, Alan; Sell, Gregory; Povey, Daniel; Khudanpur, Sanjeev

doi:10.21437/odyssey.2018-15

Cited by 176 publications

(159 citation statements)

References 17 publications

Supporting

Mentioning

158

Contrasting

Order By: Relevance

“…It is known that neural network approaches are data-hungry. With data augmentation [20] and larger datasets like VoxCeleb 2 [21], neural network approaches achieve better performance than the i-vector method. Nevertheless, for applications with limited training data, i-vector warrants in-depth investigation.…”

Section: Speaker Verification On Voxcelebmentioning

confidence: 99%

Mixture Factorized Auto-Encoder for Unsupervised Hierarchical Deep Factorization of Speech Signal

Peng

Feng

Lee

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Speech signal is constituted and contributed by various informative factors, such as linguistic content and speaker characteristic. There have been notable recent studies attempting to factorize speech signal into these individual factors without requiring any annotation. These studies typically assume continuous representation for linguistic content, which is not in accordance with general linguistic knowledge and may make the extraction of speaker information less successful. This paper proposes the mixture factorized auto-encoder (mFAE) for unsupervised deep factorization. The encoder part of mFAE comprises a frame tokenizer and an utterance embedder. The frame tokenizer models linguistic content of input speech with a discrete categorical distribution. It performs frame clustering by assigning each frame a soft mixture label. The utterance embedder generates an utterance-level vector representation. A frame decoder serves to reconstruct speech features from the encoders' outputs. The mFAE is evaluated on speaker verification (SV) task and unsupervised subword modeling (USM) task. The SV experiments on VoxCeleb 1 show that the utterance embedder is capable of extracting speaker-discriminative embeddings with performance comparable to a x-vector baseline. The USM experiments on ZeroSpeech 2017 dataset verify that the frame tokenizer is able to capture linguistic content and the utterance embedder can acquire speaker-related information.Index Termsunsupervised deep factorization, mixture factorized auto-encoder, speaker verification, unsupervised subword modeling

show abstract

Section: Speaker Verification On Voxcelebmentioning

confidence: 99%

Mixture Factorized Auto-Encoder for Unsupervised Hierarchical Deep Factorization of Speech Signal

Peng

Feng

Lee

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…We use the x-vector system as described in [16], [17]. The raw feature of the system is 40-dimensional filterbanks.…”

Section: A X-vector Systemmentioning

confidence: 99%

AP19-OLR Challenge: Three Tasks and Their Baselines

Tang

Wang

Song³

2019

2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

View full text Add to dashboard Cite

This paper introduces the fourth oriental language recognition (OLR) challenge AP19-OLR, including the data profile, the tasks and the evaluation principles. The OLR challenge has been held successfully for three consecutive years, along with APSIPA Annual Summit and Conference (APSIPA ASC). The challenge this year still focuses on practical and challenging tasks, precisely (1) short-utterance LID, (2) cross-channel LID and (3) zero-resource LID.The event this year includes more languages and more real-life data provided by SpeechOcean and the NSFC M2ASR project. All the data is free for participants. Recipes for x-vector system and back-end evaluation are also conducted as baselines for the three tasks. The participants can refer to these online-published recipes to deploy LID systems for convenience. We report the baseline results on the three tasks and demonstrate that the three tasks are worth some efforts to achieve better performance.

show abstract

“…In summary, the PLDA model (whether Gaussian or heavytailed), provides the functional form (13), (14) and (15), for extracting Gaussian meta-embeddings from i-vectors. We shall explore both generative and discriminative methods for training the parameters of this GME extractor.…”

Section: Gme Extractor and Scoringmentioning

confidence: 99%

“…The extractor parameters, W and F, are updated by backpropagating gradients through the BXE objective, through the scoring formula (9) and the extractor formula (13) and (14). The value of ν remains fixed at the plugged in value throughout training.…”

Section: Discriminative Gme Extractor Trainingmentioning

confidence: 99%

Gaussian meta-embeddings for efficient scoring of a heavy-tailed PLDA model

Brümmer¹,

Silnova²,

Burget³

et al. 2018

The Speaker and Language Recognition Workshop (Odyssey 2018)

View full text Add to dashboard Cite

Embeddings in machine learning are low-dimensional representations of complex input patterns, with the property that simple geometric operations like Euclidean distances and dot products can be used for classification and comparison tasks. We introduce meta-embeddings, which live in more general inner product spaces and which are designed to better propagate uncertainty through the embedding bottleneck. Traditional embeddings are trained to maximize between-class and minimize within-class distances. Meta-embeddings are trained to maximize relevant information throughput. As a proof of concept in speaker recognition, we derive an extractor from the familiar generative Gaussian PLDA model (GPLDA). We show that GPLDA likelihood ratio scores are given by Hilbert space inner products between Gaussian likelihood functions, which we term Gaussian meta-embeddings (GMEs). Meta-embedding extractors can be generatively or discriminatively trained. GMEs extracted by GPLDA have fixed precisions and do not propagate uncertainty. We show that a generalization to heavy-tailed PLDA gives GMEs with variable precisions, which do propagate uncertainty. Experiments on NIST SRE 2010 and 2016 show that the proposed method applied to i-vectors without length normalization is up to 20% more accurate than GPLDA applied to length-normalized i-vectors.

show abstract

Spoken Language Recognition using X-vectors

Cited by 176 publications

References 17 publications

Mixture Factorized Auto-Encoder for Unsupervised Hierarchical Deep Factorization of Speech Signal

Mixture Factorized Auto-Encoder for Unsupervised Hierarchical Deep Factorization of Speech Signal

AP19-OLR Challenge: Three Tasks and Their Baselines

Gaussian meta-embeddings for efficient scoring of a heavy-tailed PLDA model

Contact Info

Product

Resources

About