Discriminative Autoencoders for Acoustic Modeling

Yang, Ming; Lee, Hung Shin; Lu, Yu; Chen, Kuan‐Yu; Tsao, Yu; Chen, Berlin; Wang, Hsin Min

doi:10.21437/interspeech.2017-221

Cited by 5 publications

(4 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…where α and β are weighting factors. In previous studies in [29,30,31], only the CE and MSE were jointly optimized in AM training. In our experiments, we set α to 5 as in most previous studies and heuristically set β to 0.2.…”

Section: Final Objective Functionmentioning

confidence: 99%

Speech-enhanced and Noise-aware Networks for Robust Speech Recognition

Lee¹,

Chen²,

Yu³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Compensation for channel mismatch and noise interference is essential for robust automatic speech recognition. Enhanced speech has been introduced into the multi-condition training of acoustic models to improve their generalization ability. In this paper, a noise-aware training framework based on two cascaded neural structures is proposed to jointly optimize speech enhancement and speech recognition. The feature enhancement module is composed of a multi-task autoencoder, where noisy speech is decomposed into clean speech and noise. By concatenating its enhanced, noise-aware, and noisy features for each frame, the acoustic-modeling module maps each featureaugmented frame into a triphone state by optimizing the latticefree maximum mutual information and cross entropy between the predicted and actual state sequences. On top of the factorized time delay neural network (TDNN-F) and its convolutional variant (CNN-TDNNF), both with SpecAug, the two proposed systems achieve word error rate (WER) of 3.90% and 3.55%, respectively, on the Aurora-4 task. Compared with the best existing systems that use bigram and trigram language models for decoding, the proposed CNN-TDNNF-based system achieves a relative WER reduction of 15.20% and 33.53%, respectively. In addition, the proposed CNN-TDNNF-based system also outperforms the baseline CNN-TDNNF system on the AMI task.

show abstract

Section: Final Objective Functionmentioning

confidence: 99%

Speech-enhanced and Noise-aware Networks for Robust Speech Recognition

Lee¹,

Chen²,

Yu³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…To import the advantage of unsupervised learning in [15], [16] into our acoustic modeling, the DNN model for generating the emission probabilities of the output labels can be regarded as the encoder, and the decoder is used to reconstruct the acoustic features, as shown in Figure 2. The output of the penultimate layer of the encoder is supposed to be a pure phoneme-related vector without any information irrelevant to the phonetic content.…”

Section: Proposed Model a Filtering Mechanism For Acoustic Modelingmentioning

confidence: 99%

“…In [15], [16], discriminative autoencoder-based (DcAE) acoustic modeling was proposed to separate the acoustic feature into the components of phoneme, speaker and environmental noise. Such model-space innovation takes a great advantage of unsupervised learning to extract pure phonetic components from the acoustic feature to better recognize speech.…”

Section: Introductionmentioning

confidence: 99%

Filter-based Discriminative Autoencoders for Children Speech Recognition

Tai¹,

Lee²,

Yu³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Children speech recognition is indispensable but challenging due to the diversity of children's speech. In this paper, we propose a filter-based discriminative autoencoder for acoustic modeling. To filter out the influence of various speaker types and pitches, auxiliary information of the speaker and pitch features is input into the encoder together with the acoustic features to generate phonetic embeddings. In the training phase, the decoder uses the auxiliary information and the phonetic embedding extracted by the encoder to reconstruct the input acoustic features. The autoencoder is trained by simultaneously minimizing the ASR loss and feature reconstruction error. The framework can make the phonetic embedding purer, resulting in more accurate senone (triphone-state) scores. Evaluated on the test set of the CMU Kids corpus, our system achieves a 7.8% relative WER reduction compared to the baseline system. In the domain adaptation experiment, our system also outperforms the baseline system on the British-accent PF-STAR task.

show abstract

“…The other, built up of interleaving TDNN layers and long short-term memory (LSTM) layers [23], is the TDNN-LSTM structure [24], which has a wide temporal context and performs as well as bidirectional LSTM networks with less latency [25]. Following the work of discriminative autoencoders (DcAEs) for acoustic modeling in [26], where the encoder layers were only implemented by deep FFNs with temporarily augmented fMLLR features, we separately use TDNNs and TDNN-LSTM networks as encoders to see if the advantages of DcAEs can be sustained and developed. As described in [27], an autoencoder can help to preserve the most salient information.…”

Section: Introductionmentioning

confidence: 99%

Exploring the Encoder Layers of Discriminative Autoencoders for LVCSR

Huang¹,

Lee

Wang

et al. 2019

Interspeech 2019

Self Cite

View full text Add to dashboard Cite

Discriminative autoencoders (DcAEs) have been proven to improve generalization of the learned acoustic models by increasing their reconstruction capacity of input features from the frame embeddings. In this paper, we integrate DcAEs into two models, namely TDNNs and LSTMs, which have been commonly adopted in the Kaldi recipes for LVCSR in recent years, using the modified nnet3 neural network library. We also explore two kinds of skip-connection mechanisms for DcAEs, namely concatenation and addition. The results of LVCSR experiments on the MATBN Mandarin Chinese corpus and the WSJ English corpus show that the proposed DcAE-TDNNbased system achieves relative word error rate reductions of 3% and 10% over the TDNN-based baseline system, respectively. The DcAE-TDNN-LSTM-based system also outperforms the TDNN-LSTM-based baseline system. The results imply the flexibility of DcAEs to be integrated with other existing or prospective neural network-based acoustic models.

show abstract

Discriminative Autoencoders for Acoustic Modeling

Cited by 5 publications

References 23 publications

Speech-enhanced and Noise-aware Networks for Robust Speech Recognition

Speech-enhanced and Noise-aware Networks for Robust Speech Recognition

Filter-based Discriminative Autoencoders for Children Speech Recognition

Exploring the Encoder Layers of Discriminative Autoencoders for LVCSR

Contact Info

Product

Resources

About