Interspeech 2019 2019
DOI: 10.21437/interspeech.2019-1542
|View full text |Cite
|
Sign up to set email alerts
|

Far-Field End-to-End Text-Dependent Speaker Verification Based on Mixed Training Data with Transfer Learning and Enrollment Data Augmentation

Abstract: In this paper, we focus on the far-field end-to-end textdependent speaker verification task with a small-scale far-field text dependent dataset and a large scale close-talking text independent database for training. First, we show that simulating far-field text independent data from the existing large-scale clean database for data augmentation can reduce the mismatch. Second, using a small far-field text dependent data set to finetune the deep speaker embedding model pre-trained from the simulated far-field as… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
19
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
4
3
2

Relationship

3
6

Authors

Journals

citations
Cited by 30 publications
(19 citation statements)
references
References 33 publications
0
19
0
Order By: Relevance
“…According to the experiments in [22], the strategy of transfer learning performs well in the far-field text-dependent speaker verification tasks. Therefore, we select the data from SLR38, SLR47 [23], SLR62, SLR82 [24], SLR85 [25] on openslr 5…”
Section: Speaker Verification Network Architecturementioning
confidence: 99%
“…According to the experiments in [22], the strategy of transfer learning performs well in the far-field text-dependent speaker verification tasks. Therefore, we select the data from SLR38, SLR47 [23], SLR62, SLR82 [24], SLR85 [25] on openslr 5…”
Section: Speaker Verification Network Architecturementioning
confidence: 99%
“…HI-MIA includes two sub databases, which are the AISHELLwakeup 1 with utterances from 254 speakers and the AISHELL-2019B-eval dataset with utterances from 86 speakers. 23 The AISHELL-wakeup database has 3,936,003 utterances with 1,561.12 hours in total. The content of utterances covers two wake-up words, 'ni hao, mi ya ("你好,米雅")' in chinese and 'Hi, Mia' in English.…”
Section: The Hi-mia Databasementioning
confidence: 99%
“…In practice, to achieve the state-of-the-art performance, text-dependent SV requires the same set of text to be spoken during the training stage and the test stage. [9,10,11]. When there is no or limited training data with the designated phrase to build a text-dependent SV system, those data augmentation approaches couldn't help to get promising performance.…”
Section: Introductionmentioning
confidence: 99%