Joint Optimization of Classification and Clustering for Deep Speaker Embedding

Wang, Zhiming; Yao, Kaisheng; Fang, Shuo; Li, Xiaolong

doi:10.1109/asru46091.2019.9003860

Cited by 5 publications

(3 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The recent deep learning based speaker verification approaches can be primarily categorized into two main aspects: advanced network structure constructions [1,2,3,4,18] and effective loss function designs [6,19,20,21].…”

Section: Related Workmentioning

confidence: 99%

“…Various loss functions have been studied for speaker verification. Wang et al [20] jointly optimizes classification and clustering with a large margin softmax loss and a large margin Gaussian mixture loss. Logistic affinity loss [19] instead optimizes an end-to-end speaker verification model by building a learnable decision boundary to distinguish the similar pairs and dissimilar pairs.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

SpeechNAS: Towards Better Trade-off between Latency and Accuracy for Large-Scale Speaker Verification

Zhu¹,

Kong²,

Lu³

et al. 2021

Preprint

View full text Add to dashboard Cite

Recently, has been a successful and popular approach for speaker verification, which employs a time delay neural network (TDNN) and statistics pooling to extract speaker characterizing embedding from variable-length utterances. Improvement upon the x-vector has been an active research area, and enormous neural networks have been elaborately designed based on the x-vector, e.g., extended TDNN (E-TDNN) [2], factorized TDNN (F-TDNN) [3], and densely connected TDNN (D-TDNN) [4]. In this work, we try to identify the optimal architectures from a TDNN based search space employing neural architecture search (NAS), named SpeechNAS. Leveraging the recent advances in the speaker recognition, such as high-order statistics pooling, multibranch mechanism, D-TDNN and angular additive margin softmax (AAM) loss with a minimum hyper-spherical energy (MHE), SpeechNAS automatically discovers five network architectures, from SpeechNAS-1 to SpeechNAS-5, of various numbers of parameters and GFLOPs on the large-scale text-independent speaker recognition dataset VoxCeleb1. Our derived best neural network achieves an equal error rate (EER) of 1.02% on the standard test set of VoxCeleb1, which surpasses previous TDNN based state-of-the-art approaches by a large margin.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

SpeechNAS: Towards Better Trade-off between Latency and Accuracy for Large-Scale Speaker Verification

Zhu¹,

Kong²,

Lu³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…In the first, the network is trained as a multi-class classification network that classifies a large number of classes (speakers in our case). These networks use objectives that include, in addition to traditional classification losses, augmentations that are intended to encourage enhanced within-class clustering of embeddings [3,24,25,26,27,28,29] along with increased separation of embeddings of instances from different classes. The expectation is that this behavior will generalize to data outside the training set.…”

Section: Introductionmentioning

confidence: 99%

Masked Proxy Loss For Text-Independent Speaker Verification

Lian,

Kumar,

Dhamyal

et al. 2020

Preprint

View full text Add to dashboard Cite

Open-set speaker recognition can be regarded as a metric learning problem, which is to maximize inter-class variance and minimize intra-class variance. Supervised metric learning can be categorized into entity-based learning and proxybased learning 1 . Most of existing metric learning objectives like Contrastive, Triplet, Prototypical, GE2E, etc all belong to the former division, the performance of which is either highly dependent on sample mining strategy or restricted by insufficient label information in the mini-batch. Proxy-based losses mitigate both shortcomings, however, fine-grained connections among entities are either not or indirectly leveraged. This paper proposes a Mask Proxy (MP) loss which directly incorporates both proxy-based relationship and entity-based relationship. We further propose Multinomial Mask Proxy (MMP) loss to leverage the hardness of entity-to-entity pairs. These methods have been applied to evaluate on VoxCeleb test set and reach state-of-the-art Equal Error Rate(EER).

show abstract