Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System

Cai, Weicheng; Chen, Jinkun; Li, Ming

doi:10.21437/odyssey.2018-11

Cited by 284 publications

(266 citation statements)

References 35 publications

Supporting

Mentioning

265

Contrasting

Order By: Relevance

“…An encoding layer is then applied to the top of it to get the utterance level representation. The most common encoding method is the average pooling layer, which aggregates the statistics (i.e., mean, or mean and standard deviation) [1,2].…”

Section: Revisit: Deep Speaker Embeddingmentioning

confidence: 99%

Within-Sample Variability-Invariant Loss for Robust Speaker Recognition Under Noisy Environments

Cai

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

Despite the significant improvements in speaker recognition enabled by deep neural networks, unsatisfactory performance persists under noisy environments. In this paper, we train the speaker embedding network to learn the "clean" embedding of the noisy utterance. Specifically, the network is trained with the original speaker identification loss with an auxiliary within-sample variability-invariant loss. This auxiliary variability-invariant loss is used to learn the same embedding among the clean utterance and its noisy copies and prevents the network from encoding the undesired noises or variabilities into the speaker representation. Furthermore, we investigate the data preparation strategy for generating clean and noisy utterance pairs on-the-fly. The strategy generates different noisy copies for the same clean utterance at each training step, helping the speaker embedding network generalize better under noisy environments. Experiments on VoxCeleb1 indicate that the proposed training framework improves the performance of the speaker verification system in both clean and noisy conditions.

show abstract

Section: Revisit: Deep Speaker Embeddingmentioning

confidence: 99%

Within-Sample Variability-Invariant Loss for Robust Speaker Recognition Under Noisy Environments

Cai

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

show abstract

“…The superiority of deep speaker embedding systems has been shown in text-independent speaker recognition for closed talking [21,22] and far-field scenarios [24,25]. In this paper, we Figure 2: Gender and age distribution adopt the deep speaker embedding system, which is initially designed for the text-independent speaker verification, as baseline.…”

Section: Model Architecturementioning

confidence: 99%

“…The single-channel network structure is the same as in [22]. There are three main components in this framework.…”

Section: Model Architecturementioning

confidence: 99%

HI-MIA: A Far-Field Text-Dependent Speaker Verification Database and the Baselines

Qin

Bu²,

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

This paper presents a far-field text-dependent speaker verification database named HI-MIA. We aim to meet the data requirement for far-field microphone array based speaker verification since most of the publicly available databases are single channel close-talking and text-independent. The database contains recordings of 340 people in rooms designed for the far-field scenario. Recordings are captured by multiple microphone arrays located in different directions and distance to the speaker and a high-fidelity close-talking microphone. Besides, we propose a set of end-to-end neural network based baseline systems that adopt single-channel data for training. Moreover, we propose a testing background aware enrollment augmentation strategy to further enhance the performance. Results show that the fusion systems could achieve 3.29% EER in the far-field enrollment far field testing task and 4.02% EER in the close-talking enrollment and far-field testing task.

show abstract

“…One aspect of our study is therefore an attempt to find out how effective these recent developments in speaker verification are for speaker adaption in TTS. More specifically we investigate the capability of neural speaker embeddings [16,17,19] to capture and model characteristics of speakers that were unseen during TTS model training. For this purpose, we extend an improved Tacotron system in [28] to a multi-speaker TTS system and conduct systematic analysis to answer the above question.…”

Section: Introductionmentioning

confidence: 99%

Zero-Shot Multi-Speaker Text-To-Speech with State-Of-The-Art Neural Speaker Embeddings

Cooper

Lai

Yasuda

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

148

125

View full text Add to dashboard Cite

While speaker adaptation for end-to-end speech synthesis using speaker embeddings can produce good speaker similarity for speakers seen during training, there remains a gap for zero-shot adaptation to unseen speakers. We investigate multi-speaker modeling for end-to-end text-to-speech synthesis and study the effects of different types of state-of-the-art neural speaker embeddings on speaker similarity for unseen speakers. Learnable dictionary encoding-based speaker embeddings with angular softmax loss can improve equal error rates over x-vectors in a speaker verification task; these embeddings also improve speaker similarity and naturalness for unseen speakers when used for zero-shot adaptation to new speakers in endto-end speech synthesis.

show abstract

Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System

Cited by 284 publications

References 35 publications

Within-Sample Variability-Invariant Loss for Robust Speaker Recognition Under Noisy Environments

Within-Sample Variability-Invariant Loss for Robust Speaker Recognition Under Noisy Environments

HI-MIA: A Far-Field Text-Dependent Speaker Verification Database and the Baselines

Zero-Shot Multi-Speaker Text-To-Speech with State-Of-The-Art Neural Speaker Embeddings

Contact Info

Product

Resources

About