Improving Noise Robustness of Automatic Speech Recognition via Parallel Data and Teacher-student Learning

Mošner, Ladislav; Wu, Minhua; Raju, Anirudh; Parthasarathi, Sree Hari Krishnan; Kumatani, Kenichi; Sundaram, Shiva; Maas, Roland; Hoffmeister, Björn

doi:10.1109/icassp.2019.8683422

Cited by 40 publications

(40 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…The gains we observe using T/S training are in the same range as the results reported in [15], and our experiments demonstrate that we can distill information from a single channel teacher model to a multi-channel student model and learn the front-end components in a data-driven manner. The second observation we make is the WER improvements with pre-training improve performance even after T/S training.…”

Section: Results With T/s Trainingsupporting

confidence: 78%

“…We also soften the senone logits output by the teacher using temperature T . For all our experiments, we use T = 2 since that was found to be the optimal parameter in [15].…”

Section: Teacher-student Trainingmentioning

confidence: 99%

“…In Li et al [14], the authors improve speech recognition performance of a distant microphone by applying T/S training to utterances recorded simultaneously using a close-talking distant microphones. In a similar vein, Mosner et al [15] apply T/S to improve noise robustness by creating a parallel corpus by adding multimedia interference to clean utterances. T/S strategy has also been used for improving the overall ASR performance of the student model by leveraging significantly larger amount of untranscribed or unlabelled speech data.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Fully Learnable Front-End for Multi-Channel Acoustic Modeling Using Semi-Supervised Learning

Wager

Khare

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

In this work, we investigated the teacher-student training paradigm to train a fully learnable multi-channel acoustic model for far-field automatic speech recognition (ASR). Using a large offline teacher model trained on beamformed audio, we trained a simpler multi-channel student acoustic model used in the speech recognition system. For the student, both multi-channel feature extraction layers and the higher classification layers were jointly trained using the logits from the teacher model. In our experiments, compared to a baseline model trained on about 600 hours of transcribed data, a relative word-error rate (WER) reduction of about 27.3% was achieved when using an additional 1800 hours of untranscribed data. We also investigated the benefit of pre-training the multi-channel front end to output the beamformed logmel filter bank energies (LFBE) using L2 loss. We find that pre-training improves the word error rate by 10.7% when compared to a multi-channel model directly initialized with a beamformer and mel-filter bank coefficients for the front end. Finally, combining pre-training and teacher-student training produces a WER reduction of 31% compared to our baseline.

show abstract

Section: Results With T/s Trainingsupporting

confidence: 78%

“…We also soften the senone logits output by the teacher using temperature T . For all our experiments, we use T = 2 since that was found to be the optimal parameter in [15].…”

Section: Teacher-student Trainingmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Fully Learnable Front-End for Multi-Channel Acoustic Modeling Using Semi-Supervised Learning

Wager

Khare

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

show abstract

“…Teacher-student (T/S) learning [1,2] has been widely applied to a variety of deep learning tasks in speech, language and image processing including model compression [1,2], domain adaptation [3,4,5], small-footprint natural machine translation (NMT) [6], low-resource NMT [7], far-field automatic speech recognition (ASR) [8,9], lowresource language ASR [10] and neural network pre-training [11]. T/S learning falls in the category of transfer learning, where the network of interest, as a student, is trained by mimicking the behavior of a well-trained network, as a teacher, in the presence of the same or stereo training samples.…”

Section: Introductionmentioning

confidence: 99%

Conditional Teacher-student Learning

Meng

Zhao

et al. 2019

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

The teacher-student (T/S) learning has been shown to be effective for a variety of problems such as domain adaptation and model compression. One shortcoming of the T/S learning is that a teacher model, not always perfect, sporadically produces wrong guidance in form of posterior probabilities that misleads the student model towards a suboptimal performance. To overcome this problem, we propose a conditional T/S learning scheme, in which a "smart" student model selectively chooses to learn from either the teacher model or the ground truth labels conditioned on whether the teacher can correctly predict the ground truth. Unlike a naive linear combination of the two knowledge sources, the conditional learning is exclusively engaged with the teacher model when the teacher model's prediction is correct, and otherwise backs off to the ground truth. Thus, the student model is able to learn effectively from the teacher and even potentially surpass the teacher. We examine the proposed learning scheme on two tasks: domain adaptation on CHiME-3 dataset and speaker adaptation on Microsoft short message dictation dataset. The proposed method achieves 9.8% and 12.8% relative word error rate reductions, respectively, over T/S learning for environment adaptation and speaker-independent model for speaker adaptation.

show abstract

“…The AM is based on the standard HMM/deep learning hybrid, and we summarize details relevant to this paper in Section II-B. Other aspects of this system have been described elsewhere ( [25], [26], [27], [28]). The LM [29] estimates the a priori probability that the speaker will utter a sequence of words.…”

Section: Introductionmentioning

confidence: 99%

Realizing Petabyte Scale Acoustic Modeling

Parthasarathi

Sivakrishnan

Ladkat

et al. 2019

IEEE J. Emerg. Sel. Topics Circuits Syst.

View full text Add to dashboard Cite

Large scale machine learning (ML) systems such as the Alexa automatic speech recognition (ASR) system continue to improve with increasing amounts of manually transcribed training data. Instead of scaling manual transcription to impractical levels, we utilize semi-supervised learning (SSL) to learn acoustic models (AM) from the vast firehose of untranscribed audio data. Learning an AM from 1 Million hours of audio presents unique ML and system design challenges. We present the design and evaluation of a highly scalable and resource efficient SSL system for AM. Employing the student/teacher learning paradigm, we focus on the student learning subsystem: a scalable and robust data pipeline that generates features and targets from raw audio, and an efficient model pipeline, including the distributed trainer, that builds a student model. Our evaluations show that, even without extensive hyper-parameter tuning, we obtain relative accuracy improvements in the 10 to 20% range, with higher gains in noisier conditions. The end-to-end processing time of this SSL system was 12 days, and several components in this system can trivially scale linearly with more compute resources.

show abstract

Improving Noise Robustness of Automatic Speech Recognition via Parallel Data and Teacher-student Learning

Cited by 40 publications

References 26 publications

Fully Learnable Front-End for Multi-Channel Acoustic Modeling Using Semi-Supervised Learning

Fully Learnable Front-End for Multi-Channel Acoustic Modeling Using Semi-Supervised Learning

Conditional Teacher-student Learning

Realizing Petabyte Scale Acoustic Modeling

Contact Info

Product

Resources

About