2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2018
DOI: 10.1109/icassp.2018.8461932
|View full text |Cite
|
Sign up to set email alerts
|

Speaker-Invariant Training Via Adversarial Learning

Abstract: We propose a novel adversarial multi-task learning scheme, aiming at actively curtailing the inter-talker feature variability while maximizing its senone discriminability so as to enhance the performance of a deep neural network (DNN) based ASR system. We call the scheme speaker-invariant training (SIT). In SIT, a DNN acoustic model and a speaker classifier network are jointly optimized to minimize the senone (tied triphone state) classification loss, and simultaneously mini-maximize the speaker classification… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
63
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
5
3
2

Relationship

2
8

Authors

Journals

citations
Cited by 94 publications
(63 citation statements)
references
References 26 publications
0
63
0
Order By: Relevance
“…Different from domain adaptation, speaker adaptation has only access to limited adaptation data from target speakers and has no access to the source-domain data. Many techniques have been proposed for speaker adaptation of deep acoustic models, such as regularization-based [25,26,27], transformation-based [28,29], singular value decomposition-based [30,31], subspace-based [32,33] and adversarial learning-based [34,35] approaches. Among these, KL divergence (KLD) regularization [25] is one of the most popular methods to prevent the adapted model from overfitting the limited speaker data.…”
Section: Conditional T/s Learning For Speaker Adaptationmentioning
confidence: 99%
“…Different from domain adaptation, speaker adaptation has only access to limited adaptation data from target speakers and has no access to the source-domain data. Many techniques have been proposed for speaker adaptation of deep acoustic models, such as regularization-based [25,26,27], transformation-based [28,29], singular value decomposition-based [30,31], subspace-based [32,33] and adversarial learning-based [34,35] approaches. Among these, KL divergence (KLD) regularization [25] is one of the most popular methods to prevent the adapted model from overfitting the limited speaker data.…”
Section: Conditional T/s Learning For Speaker Adaptationmentioning
confidence: 99%
“…We additionally provide results on the combined WSJ0+CHiME3 dataset. [11] to use far-field speech from the fifth microphone channel for all sets. We adopt the same input-output setting for CHiME3 as WSJ0.…”
Section: Methodsmentioning
confidence: 99%
“…The frame-level cluster labels are regarded as pseudo phone labels to support supervised DNN training. Motivated by successful applications of adversarial training [20] in a wide range of domain invariant learning tasks [21][22][23][24], this work proposes to add an auxiliary adversarial speaker classification task to explicitly target speaker-invariant feature learning. After speaker adversarial multi-task learning (AMTL) DNN training, softmax PG representation from pseudo phone classification task is used to infer subword unit sequences.…”
Section: System Description 21 General Frameworkmentioning
confidence: 99%