Speaker Adaptive Training of Deep Neural Network Acoustic Models Using I-Vectors

Miao, Yajie; Zhang, Hao; Metze, Florian

doi:10.1109/taslp.2015.2457612

Cited by 118 publications

(100 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Our experiments inherit the setup used in [9]. We select 100k utterances from the Switchboard-1 pack and create a training set with 110 hours of conversational telephone speech.…”

Section: Experimental Setup and Baseline Resultsmentioning

confidence: 99%

“…Examples of the solutions include augmenting the speakerindependent DNN with additional layers [3,4], adapting the activation function [6] and using speaker-adapted feature space [2,7,8]. To further resolve this issue, our recent study [9] ported the concept of SAT to DNNs. Training of SAT-DNN models starts from an initial DNN which has been trained over all the speakers.…”

Section: Introductionmentioning

confidence: 99%

“…Finally, we update the initial DNN in the new feature space, which generates the canonical DNN model. In hybrid systems, SAT-DNN models have shown significant WER improvement over DNNs [9], regardless of whether the inputs are speakerindependent (e.g., filterbank) or speaker-adapted (fMLLR) features. The goal of this paper is to analyze appropriate settings for the SAT-DNN architecture and explore possible improvements to it.…”

Section: Introductionmentioning

confidence: 99%

“…Second, hybrid systems have been shown to benefit from the SAT-DNN approach [9]. Apart from hybrid systems, popular applications of deep learning also include BNF generation [11] and CNN-based acoustic modeling [12,13].…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Improvements to speaker adaptive training of deep neural networks

Miao

Jiang

Zhang

et al. 2014

2014 IEEE Spoken Language Technology Workshop (SLT)

Self Cite

View full text Add to dashboard Cite

Speaker adaptive training (SAT) is a well studied technique for Gaussian mixture acoustic models (GMMs). Recently we proposed to perform SAT for deep neural networks (DNNs), with speaker i-vectors applied in feature learning. The resulting SAT-DNN models significantly outperform DNNs on word error rates (WERs). In this paper, we present different methods to further improve and extend SAT-DNN. First, we conduct detailed analysis to investigate i-vector extractor training and flexible feature fusion. Second, the SAT-DNN approach is extended to improve tasks including bottleneck feature (BNF) generation, convolutional neural network (CNN) acoustic modeling and multilingual DNNbased feature extraction. Third, for transcribing multimedia data, we enrich the i-vector representation with global speaker attributes (age, gender, etc.) obtained automatically from the video signal. On a collection of instructional videos, incorporation of the additional visual features is observed to boost the recognition accuracy of SAT-DNN.

show abstract

“…Our experiments inherit the setup used in [9]. We select 100k utterances from the Switchboard-1 pack and create a training set with 110 hours of conversational telephone speech.…”

Section: Experimental Setup and Baseline Resultsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Improvements to speaker adaptive training of deep neural networks

Miao

Jiang

Zhang

et al. 2014

2014 IEEE Spoken Language Technology Workshop (SLT)

Self Cite

View full text Add to dashboard Cite

show abstract

“…As principais contribuições resultantes desta tese são descritas na seção seguinte. [23], sendo atualmente uma técnica amplamente conhecida e explorada por diversos pesquisadores, como em [8], [9], [24], [25], [26]. Estes sistemas vêm sendo utilizados em aplicações biométricas comerciais nas últimas décadas, desde aplicações mais simples como o desbloqueio do smartphone com o sinal de voz ou até mesmo sistemas de segurança nacional como identificação de vozes em perícias criminais [27].…”

Section: Figura 1 -Sistema Biométrico Genéricounclassified