Investigation of transfer learning for ASR using LF-MMI trained neural networks

Ghahremani, Pegah; Manohar, Vimal; Hadian, Hossein; Povey, Daniel; Khudanpur, Sanjeev

doi:10.1109/asru.2017.8268947

Cited by 61 publications

(45 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This is particularly important, since different domains might have different amounts of data. In previous work [31], Ghahremani et al recommend that gradients of utterances from a particular domain should be scaled by the inverse of the square root of the number of utterances in the domain, thus effectively over-sampling domains with less data. In Tab.…”

Section: Multidomain Training: Impact Of Data Diversitymentioning

confidence: 99%

Recognizing Long-Form Speech Using Streaming End-to-End Models

Narayanan

Prabhavalkar

Chiu

et al. 2019

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

109

View full text Add to dashboard Cite

All-neural end-to-end (E2E) automatic speech recognition (ASR) systems that use a single neural network to transduce audio to word sequences have been shown to achieve state-of-the-art results on several tasks. In this work, we examine the ability of E2E models to generalize to unseen domains, where we find that models trained on short utterances fail to generalize to long-form speech. We propose two complementary solutions to address this: training on diverse acoustic data, and LSTM state manipulation to simulate long-form audio when training using short utterances. On a synthesized longform test set, adding data diversity improves word error rate (WER) by 90% relative, while simulating long-form training improves it by 67% relative, though the combination doesn't improve over data diversity alone. On a real long-form call-center test set, adding data diversity improves WER by 40% relative. Simulating long-form training on top of data diversity improves performance by an additional 27% relative.

show abstract

Section: Multidomain Training: Impact Of Data Diversitymentioning

confidence: 99%

Recognizing Long-Form Speech Using Streaming End-to-End Models

Narayanan

Prabhavalkar

Chiu

et al. 2019

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

109

View full text Add to dashboard Cite

show abstract

“…However, because speech signals are high-dimensional and highly variable even for a single speaker, training deep models and learning these hierarchical representations without a large amount of training data is difficult. The computer vision [15,16], natural language processing [17][18][19][20][21], and ASR [22][23][24][25] communities have attacked the problem of limited supervised training data with great success by pre-training deep models on related tasks for which there is more training data. Following their lead, we propose an efficient ASR-based pre-training methodology in this paper and show that it may be used to improve the performance of end-toend SLU models, especially when the amount of training data is very small.…”

Section: Introductionmentioning

confidence: 99%

Speech Model Pre-Training for End-to-End Spoken Language Understanding

Lugosch

Ravanelli

Ignoto³

et al. 2019

Interspeech 2019

193

268

View full text Add to dashboard Cite

Whereas conventional spoken language understanding (SLU) systems map speech to text, and then text to intent, end-toend SLU systems map speech directly to intent through a single trainable model. Achieving high accuracy with these end-toend models without a large amount of training data is difficult. We propose a method to reduce the data requirements of endto-end SLU in which the model is first pre-trained to predict words and phonemes, thus learning good features for SLU. We introduce a new SLU dataset, Fluent Speech Commands, and show that our method improves performance both when the full dataset is used for training and when only a small subset is used. We also describe preliminary experiments to gauge the model's ability to generalize to new phrases not heard during training.

show abstract

“…They also show that pooling data from multiple low-resource domains work better than transfer learning. Unlike [7], the current work studies domain robustness in a much larger scale, where data sparsity is not necessarily a challenge. We also study other forms of mismatch like codec, and consider many more applications domains.…”

Section: Prior Workmentioning

confidence: 99%

Toward Domain-Invariant Speech Recognition via Large Scale Training

Narayanan

Misra

Sim

et al. 2018

2018 IEEE Spoken Language Technology Workshop (SLT)

View full text Add to dashboard Cite

Current state-of-the-art automatic speech recognition systems are trained to work in specific 'domains', defined based on factors like application, sampling rate and codec. When such recognizers are used in conditions that do not match the training domain, performance significantly drops. This work explores the idea of building a single domain-invariant model for varied use-cases by combining large scale training data from multiple application domains. Our final system is trained using 162,000 hours of speech. Additionally, each utterance is artificially distorted during training to simulate effects like background noise, codec distortion, and sampling rates. Our results show that, even at such a scale, a model thus trained works almost as well as those fine-tuned to specific subsets: A single model can be robust to multiple application domains, and variations like codecs and noise. More importantly, such models generalize better to unseen conditions and allow for rapid adaptation -we show that by using as little as 10 hours of data from a new domain, an adapted domain-invariant model can match performance of a domain-specific model trained from scratch using 70 times as much data. We also highlight some of the limitations of such models and areas that need addressing in future work.

show abstract

Investigation of transfer learning for ASR using LF-MMI trained neural networks

Cited by 61 publications

References 21 publications

Recognizing Long-Form Speech Using Streaming End-to-End Models

Recognizing Long-Form Speech Using Streaming End-to-End Models

Speech Model Pre-Training for End-to-End Spoken Language Understanding

Toward Domain-Invariant Speech Recognition via Large Scale Training

Contact Info

Product

Resources

About