Interspeech 2017 2017
DOI: 10.21437/interspeech.2017-1050
|View full text |Cite
|
Sign up to set email alerts
|

Joint Training of Expanded End-to-End DNN for Text-Dependent Speaker Verification

Abstract: We propose an expanded end-to-end DNN architecture for speaker verification based on b-vectors as well as d-vectors. We embedded the components of a speaker verification system such as modeling frame-level features, extracting utterance-level features, dimensionality reduction of utterancelevel features, and trial-level scoring in an expanded end-toend DNN architecture. The main contribution of this paper is that, instead of using DNNs as parts of the system trained independently, we train the whole system joi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
17
0

Year Published

2018
2018
2021
2021

Publication Types

Select...
4
2
2

Relationship

3
5

Authors

Journals

citations
Cited by 24 publications
(17 citation statements)
references
References 14 publications
0
17
0
Order By: Relevance
“…The batch size for training the two models was 40. For efficient training of the spectrogram-based CNN-GRU model, we trained the CNN part except the GRU layer in the whole model and then re-trained after attaching the GRU layer on the CNN, following the multi-step training scheme reported in [9,21]. Table 1 demonstrates the baseline of this study, which is an improved version of the authors' submission to DCASE 2018 competition.…”
Section: Experimental Settingsmentioning
confidence: 99%
“…The batch size for training the two models was 40. For efficient training of the spectrogram-based CNN-GRU model, we trained the CNN part except the GRU layer in the whole model and then re-trained after attaching the GRU layer on the CNN, following the multi-step training scheme reported in [9,21]. Table 1 demonstrates the baseline of this study, which is an improved version of the authors' submission to DCASE 2018 competition.…”
Section: Experimental Settingsmentioning
confidence: 99%
“…Deep networks often exploit pre-training schemes to show improved generalization performance. One such scheme was introduced by Heo et al [10]. This scheme trains the DNN over several stages, each stage using the parameters of preceding DNN as initialization.…”
Section: Multi-step Trainingmentioning
confidence: 99%
“…Studies on raw waveform processing and back-end classification have also occurred [3,6,9]. Individual DNNs are integrated to comprise end-to-end DNNs [1,3,10,11,12,13].…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…Recently, the speaker representation models have moved from the commonly used i-vector model [1,2,3], with a probabilistic linear discriminant (PLDA) back-end [4,5] to a new paradigm: speaker embedding trained from deep neural networks. Various speaker embeddings based on different network architectures [6,7] , attention mechanism [8,9], loss functions [10,11], noise robustness [12,13], and training paradigms [14,15] have been proposed and greatly improve the performance of speaker verification systems. Snyder et al [6] recently proposed the x-vector model, which is based on a Time-Delay Deep Neural Network (TDNN) architecture that computes speaker embeddings from variable-length acoustic segments.…”
Section: Introductionmentioning
confidence: 99%