Improving i-Vector and PLDA Based Speaker Clustering with Long-Term Features

Woubie, Abraham; Luque, Jordi; Hernando, Javier

doi:10.21437/interspeech.2016-339

Cited by 7 publications

(6 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In our work, we have proposed the extraction of i-vectors from the short-term cepstral and long-term speech features and the fusion of their cosine-distance and PLDA scores. These results have already been published in our previous works of [21,22].…”

Section: Introductionsupporting

confidence: 85%

See 1 more Smart Citation

The use of long-term features for GMM- and i-vector-based speaker diarization systems

Zewoudie

Luque

Hernando

2018

J AUDIO SPEECH MUSIC PROC.

Self Cite

View full text Add to dashboard Cite

Several factors contribute to the performance of speaker diarization systems. For instance, the appropriate selection of speech features is one of the key aspects that affect speaker diarization systems. The other factors include the techniques employed to perform both segmentation and clustering. While the static mel frequency cepstral coefficients are the most widely used features in speech-related tasks including speaker diarization, several studies have shown the benefits of augmenting regular speech features with the static ones. In this work, we have proposed and assessed the use of voice-quality features (i.e., jitter, shimmer, and Glottal-to-Noise Excitation ratio) within the framework of speaker diarization. These acoustic attributes are employed together with the state-of-the-art short-term cepstral and long-term prosodic features. Additionally, the use of delta dynamic features is also explored separately both for segmentation and bottom-up clustering sub-tasks. The combination of the different feature sets is carried out at several levels. At the feature level, the long-term speech features are stacked in the same feature vector. At the score level, the short-and long-term speech features are independently modeled and fused at the score likelihood level. Various feature combinations have been applied both for Gaussian mixture modeling and i-vector-based speaker diarization systems. The experiments have been carried out on Augmented Multi-party Interaction meeting corpus. The best result, in terms of diarization error rate, is reported by using i-vector-based cosine-distance clustering together with a signal parameterization consisting of a combination of static cepstral coefficients, delta, voice-quality, and prosodic features. The best result shows about 24% relative diarization error rate improvement compared to the baseline system which is based on Gaussian mixture modeling and short-term static cepstral coefficients.

show abstract

Section: Introductionsupporting

confidence: 85%

“…In all of our previous works [9,10,21,22], only the static MFCCs were used. The deltas were not used in these works.…”

Section: Introductionmentioning

confidence: 99%

The use of long-term features for GMM- and i-vector-based speaker diarization systems

Zewoudie

Luque

Hernando

2018

J AUDIO SPEECH MUSIC PROC.

Self Cite

View full text Add to dashboard Cite

show abstract

“…If the score is greater than a pre-defined threshold τ , x is accepted as a reference speaker's utterance; otherwise, it is rejected. The input observation x can be a raw speech waveform itself or an encoded vector using various feature extraction algorithms for speaker verification such as Mel-frequency cepstral coefficients (MFCCs) [25], i-vector [26][27][28][29], or speaker embedding vectors [5,7,8,15]. In this paper, we model the raw speech .., xe M }, and we define the score function f (·, ·) based on the cosine similarity: Fig.…”

Section: Speaker Verificationmentioning

confidence: 99%

An End-to-End Text-Independent Speaker Verification Framework with a Keyword Adversarial Network

Yun

Cho

Eum³

et al. 2019

Interspeech 2019

View full text Add to dashboard Cite

This paper presents an end-to-end text-independent speaker verification framework by jointly considering the speaker embedding (SE) network and automatic speech recognition (ASR) network. The SE network learns to output an embedding vector which distinguishes the speaker characteristics of the input utterance, while the ASR network learns to recognize the phonetic context of the input. In training our speaker verification framework, we consider both the triplet loss minimization and adversarial gradient of the ASR network to obtain more discriminative and text-independent speaker embedding vectors. With the triplet loss, the distances between the embedding vectors of the same speaker are minimized while those of different speakers are maximized. Also, with the adversarial gradient of the ASR network, the text-dependency of the speaker embedding vector can be reduced. In the experiments, we evaluated our speaker verification framework using the LibriSpeech and CHiME 2013 dataset, and the evaluation results show that our speaker verification framework shows lower equal error rate and better textindependency compared to the other approaches.

show abstract

“…The clustering assigns a label set Y = {y1, ..., yN } to X, and yi ∈ {1, ..., K}. Each observation xi of dimension D can be a speech utterance itself or an encoded vector using various feature extraction algorithms for speaker clustering such as Mel frequency cepstral coefficients (MFCCs) [18], glottal to noise excitation ratio (GNE) [8], i-vector [11,14,10,8] and MBN [7]. In this paper, we use the i-vector for the feature vector of an observation, and it can be obtained by:…”

Section: Speaker Clusteringmentioning

confidence: 99%

“…In addition, the speaker's voice identification and verification [4,5,6] are becoming attractive features for user-specific services. To provide such services, speaker clustering [7,8] plays a key role in identifying the number of speakers and grouping the utterances from the same user for the automatic user-specific model generation or speaker diarization [9,10].…”

Section: Introductionmentioning

confidence: 99%

Speaker Clustering by Iteratively Finding Discriminative Feature Space and Cluster Labels

2017

View full text Add to dashboard Cite

This paper presents a speaker clustering framework by iteratively performing two stages: a discriminative feature space is obtained given a cluster label set, and the cluster label set is updated using a clustering algorithm given the feature space. In the iterations of two stages, the cluster labels may be different from the true labels, and thus the obtained feature space based on the labels may be inaccurately discriminated. However, by iteratively performing above two stages, more accurate cluster labels and more discriminative feature space can be obtained, and finally they are converged. In this research, the linear discriminant analysis is used for discriminating the ivector feature space, and the variational Bayesian expectationmaximization on Gaussian mixture model is used for clustering the i-vectors. Our iterative clustering framework was evaluated using the database of keyword utterances and compared with the recently-published approaches. In all experiments, the results show that our framework outperforms the other approaches and converges in a few iterations.

show abstract

Improving i-Vector and PLDA Based Speaker Clustering with Long-Term Features

Cited by 7 publications

References 14 publications

The use of long-term features for GMM- and i-vector-based speaker diarization systems

The use of long-term features for GMM- and i-vector-based speaker diarization systems

An End-to-End Text-Independent Speaker Verification Framework with a Keyword Adversarial Network

Speaker Clustering by Iteratively Finding Discriminative Feature Space and Cluster Labels

Contact Info

Product

Resources

About