Rep Works in Speaker Verification

Ma, Yong; Zhao, Miao; Ding, Yiwei; Zheng, Yu; Liu, Min; Xu, Min

doi:10.48550/arxiv.2110.09720

Cited by 1 publication

(1 citation statement)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In the training stage, the model adopts the multi-branch topology to learn the multi-scale speaker's features, while the network uses the re-parameterization approach to convert all branches to a single-path topology in calculation with a high inference speed in the inference stage. The idea of re-parameterization was first proposed in [30] on RepVGG model for image processing, and was later adapted to ASV in [32], [33]. But these models were based on Rep-VGG's structure, where only a limited number of branches (temporal scales) could be integrated.…”

Section: B Multi-branch-based Speaker Embedding Modelsmentioning

confidence: 99%

TMS: A Temporal Multi-scale Backbone Design for Speaker Embedding

Zhang¹,

Jiang²,

Lu³

et al. 2022

Preprint

View full text Add to dashboard Cite

Speaker embedding is an important front-end module to explore discriminative speaker features (e.g., X-vector) for many speech applications where speaker information is needed. Current state-of-the-art backbone networks for speaker embedding are designed to aggregate multi-scale features from an utterance with multi-branch network architectures for speaker representation (e.g., ECAPA-TDNN). However, naively adding many branches of multi-scale features with the simple fully convolutional operation could not efficiently improve the performance due to the rapid increase of model parameters and computational complexity. Therefore, in the most current state-of-the-art network architectures, only a few branches corresponding to a limited number of temporal scales could be designed for speaker embeddings. To address this problem, in this paper, we propose an effective temporal multi-scale (TMS) model where multi-scale branches could be efficiently designed in a speaker embedding network almost without increasing computational costs. The new model is based on the conventional time-delay neural network (TDNN), where the network architecture is smartly separated into two modeling operators: a channel-modeling operator and a temporal multi-branch modeling operator. Adding temporal multi-scale in the temporal multi-branch operator needs only a little bit increase of the number of parameters, and thus save more computational budget for adding more branches with large temporal scales. Moreover, after the model was trained, in the inference stage, we further developed a systemic reparameterization method to convert the multi-branch network topology into a single-path-based topology in order to increase inference speed. We investigated the performance of the new TMS method for automatic speaker verification (ASV) on indomain (VoxCeleb) and out-of-domain (CNCeleb) conditions. Results show that the model based on the TMS method obtained a significant increase in the performance over the state-of-the-art ASV models, i.e., ECAPA-TDNN, and meanwhile, had a better model generalization. Moreover, the proposed model achieved a 29% -46% speed up in inference compared to the state-of-theart ECAPA-TDNN.

show abstract

Section: B Multi-branch-based Speaker Embedding Modelsmentioning

confidence: 99%