Locally-connected and convolutional neural networks for small footprint speaker recognition

Chen, Yu‐Hsin; López-Moreno, Ignacio; Sainath, Tara N.; Visontai, Mirkó; Álvarez, Raziel; Parada, Carolina

doi:10.21437/interspeech.2015-297

Cited by 51 publications

(40 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…We can classify d-vector based SV systems according to the loss function used. The first one is based on the softmax loss defined in [23] as the combination of a cross-entropy loss, a softmax function and the last fully connected layer [7,8,24]. In this system, a speaker classifier is trained to classify speakers in the training set.…”

Section: D-vector Systemsmentioning

confidence: 99%

See 1 more Smart Citation

Spatial Pyramid Encoding with Convex Length Normalization for Text-Independent Speaker Verification

Jung¹,

Kim

Lim³

et al. 2019

Interspeech 2019

View full text Add to dashboard Cite

In this paper, we propose a new pooling method called spatial pyramid encoding (SPE) to generate speaker embeddings for text-independent speaker verification. We first partition the output feature maps from a deep residual network (ResNet) into increasingly fine sub-regions and extract speaker embeddings from each sub-region through a learnable dictionary encoding layer. These embeddings are concatenated to obtain the final speaker representation. The SPE layer not only generates a fixed-dimensional speaker embedding for a variable-length speech segment, but also aggregates the information of feature distribution from multi-level temporal bins. Furthermore, we apply deep length normalization by augmenting the loss function with ring loss. By applying ring loss, the network gradually learns to normalize the speaker embeddings using model weights themselves while preserving convexity, leading to more robust speaker embeddings. Experiments on the VoxCeleb1 dataset show that the proposed system using the SPE layer and ring loss-based deep length normalization outperforms both ivector and d-vector baselines. Index Terms: speaker verification, spatial pyramid encoding, learnable dictionary encoding, ring loss, length normalization d-vector systemsWe can classify d-vector based SV systems according to the loss function used. The first one is based on the softmax loss defined in [23] as the combination of a cross-entropy loss, a softmax function and the last fully connected layer [7,8,24]. In this system, a speaker classifier is trained to classify speakers in the training set. The softmax loss encourages the separability of speaker embeddings. However, the softmax loss is not sufficient to learn the discriminative embedding with a large margin, and more researchers began to explore discriminative loss functions for enhanced generalization ability.

show abstract

Section: D-vector Systemsmentioning

confidence: 99%

“…Another deep learning-based approach is to extract speaker embeddings directly from a speaker discriminative network [7][8][9][10][11]. In such systems, the network is trained to classify speakers in the training set, or to separate same-speaker and different-speaker utterance pairs.…”

Section: Introductionmentioning

confidence: 99%

Spatial Pyramid Encoding with Convex Length Normalization for Text-Independent Speaker Verification

Jung¹,

Kim

Lim³

et al. 2019

Interspeech 2019

View full text Add to dashboard Cite

show abstract

“…In our work we focused on text independent speaker verification [1,2]. Deep learning based speaker verification systems [3,4,5,6,7,8] are getting popular this days and such systems have improved the performance of speaker verification systems.…”

Section: Introductionmentioning

confidence: 99%

Robust End-to-End Speaker Verification Using EEG

Han

Krishna

Tran

et al. 2021

2020 28th European Signal Processing Conference (EUSIPCO)

View full text Add to dashboard Cite

In this paper we demonstrate that performance of a speaker verification system can be improved by concatenating electroencephalography (EEG) signal features with speech signal. We use state of art end to end deep learning model for performing speaker verification and we demonstrate our results for noisy speech.Our results indicate that EEG signals can improve the robustness of speaker verification systems.

show abstract

“…In the speaker verification field, deep neural networks (DNNs) have been used as speaker embedding extractors. Generally, a speaker embedding-based speaker verification system executes the following process [1][2][3][4]:…”

Section: Introductionmentioning

confidence: 99%

Self-Attentive Multi-Layer Aggregation with Feature Recalibration and Deep Length Normalization for Text-Independent Speaker Verification System

Seo

Kim

2020

Electronics

View full text Add to dashboard Cite

One of the most important parts of a text-independent speaker verification system is speaker embedding generation. Previous studies demonstrated that shortcut connections-based multi-layer aggregation improves the representational power of a speaker embedding system. However, model parameters are relatively large in number, and unspecified variations increase in the multi-layer aggregation. Therefore, in this study, we propose a self-attentive multi-layer aggregation with feature recalibration and deep length normalization for a text-independent speaker verification system. To reduce the number of model parameters, we set the ResNet with the scaled channel width and layer depth as a baseline. To control the variability in the training, we apply a self-attention mechanism to perform multi-layer aggregation with dropout regularizations and batch normalizations. Subsequently, we apply a feature recalibration layer to the aggregated feature using fully-connected layers and nonlinear activation functions. Further, deep length normalization is used on a recalibrated feature in the training process. Experimental results using the VoxCeleb1 evaluation dataset showed that the performance of the proposed methods was comparable to that of state-of-the-art models (equal error rate of 4.95% and 2.86%, using the VoxCeleb1 and VoxCeleb2 training datasets, respectively).

show abstract

Locally-connected and convolutional neural networks for small footprint speaker recognition

Cited by 51 publications

References 9 publications

Spatial Pyramid Encoding with Convex Length Normalization for Text-Independent Speaker Verification

Spatial Pyramid Encoding with Convex Length Normalization for Text-Independent Speaker Verification

Robust End-to-End Speaker Verification Using EEG

Self-Attentive Multi-Layer Aggregation with Feature Recalibration and Deep Length Normalization for Text-Independent Speaker Verification System

Contact Info

Product

Resources

About