Toward Domain-Invariant Speech Recognition via Large Scale Training

Narayanan, Arun; Misra, Ananya; Sim, Khe Chai; Pundak, Golan; Tripathi, Anshuman; Elfeky, Mohamed; Haghani, Parisa; Strohman, Trevor; Bacchiani, Michiel

doi:10.1109/slt.2018.8639610

Cited by 94 publications

(67 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For training, we use the same multidomain datasets as in [20,21] which include anonymized and hand-transcribed English utterances from general Google traffic, far-field environments, telephony conversations, and YouTube. We augment the clean training utterances by artificially corrupting them by using a room simulator, varying degrees of noise, and reverberation such that the signal-to-noise ratio (SNR) is between 0dB and 30dB [23].…”

Section: Datasetsmentioning

confidence: 99%

“…Our experiments are conducted using the same training data as in [20,21], which is from multiple domains such as Voice Search, YouTube, Farfield and Telephony. We first analyze the behavior of the deliberation model, including performance when attending to multiple RNN-T hypotheses, contribution of different attention, and rescoring vs. beam search.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Deliberation Model Based Two-Pass End-To-End Speech Recognition

Sainath

Pang

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

End-to-end (E2E) models have made rapid progress in automatic speech recognition (ASR) and perform competitively relative to conventional models. To further improve the quality, a two-pass model has been proposed to rescore streamed hypotheses using the nonstreaming Listen, Attend and Spell (LAS) model while maintaining a reasonable latency. The model attends to acoustics to rescore hypotheses, as opposed to a class of neural correction models that use only first-pass text hypotheses. In this work, we propose to attend to both acoustics and first-pass hypotheses using a deliberation network. A bidirectional encoder is used to extract context information from first-pass hypotheses. The proposed deliberation model achieves 12% relative WER reduction compared to LAS rescoring in Google Voice Search (VS) tasks, and 23% reduction on a proper noun test set. Compared to a large conventional model, our best model performs 21% relatively better for VS. In terms of computational complexity, the deliberation decoder has a larger size than the LAS decoder, and hence requires more computations in second-pass decoding.

show abstract

Section: Datasetsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Deliberation Model Based Two-Pass End-To-End Speech Recognition

Sainath

Pang

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…Another important aspect in building high-performance speech recognition systems is the amount and the coverage of the training data. To build high performance speech recognition systems for conversational speech, we need to use a large amount of speech data covering various domains [17]. In [18], it has been shown that we need a very large training set (∼125,000 hours of semi-supervised speech data) to achieve high speech recognition accuracy for difficult tasks like video captioning.…”

Section: Introductionmentioning

confidence: 99%

End-to-End Training of a Large Vocabulary End-to-End Speech Recognition System

Kim

Shin

Singh

et al. 2019

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

View full text Add to dashboard Cite

In this paper, we present an end-to-end training framework for building state-of-the-art end-to-end speech recognition systems.Our training system utilizes a cluster of Central Processing Units (CPUs) and Graphics Processing Units (GPUs). The entire data reading, large scale data augmentation, neural network parameter updates are all performed "on-the-fly". We use vocal tract length perturbation [1] and an acoustic simulator [2] for data augmentation. The processed features and labels are sent to the GPU cluster. The Horovod allreduce approach is employed to train neural network parameters. We evaluated the effectiveness of our system on the standard Librispeech corpus [3] and the 10,000-hr anonymized Bixby English dataset. Our end-to-end speech recognition system built using this training infrastructure showed a 2.44 % WER on test-clean of the LibriSpeech test set after applying shallow fusion with a Transformer language model (LM). For the proprietary English Bixby open domain test set, we obtained a WER of 7.92 % using a Bidirectional Full Attention (BFA) end-to-end model after applying shallow fusion with an RNN-LM. When the monotonic chunckwise attention (MoCha) based approach is employed for streaming speech recognition, we obtained a WER of 9.95 % on the same Bixby open domain test set.

show abstract

“…In the era of deep neural networks, it has been frequently observed that the amount and coverage of the training data seem to be one of the most important factors to obtain better speech recognition accuracy [12,13]. However, it is very difficult to gather sufficient amount of transcribed data from various domains.…”

Section: Introductionmentioning

confidence: 99%

Power-Law Nonlinearity with Maximally Uniform Distribution Criterion for Improved Neural Network Training in Automatic Speech Recognition

Kim

Kumar

Kim

et al. 2019

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

View full text Add to dashboard Cite

In this paper, we describe the Maximum Uniformity of Distribution (MUD) algorithm with the power-law nonlinearity. In this approach, we hypothesize that neural network training will become more stable if feature distribution is not too much skewed. We propose two different types of MUD approaches: power function-based MUD and histogram-based Thanks to Samsung Electronics for funding this research. The authors are thankful to Executive Vice President Seunghwan Cho and speech processing Lab. members at Samsung Research.

show abstract

Toward Domain-Invariant Speech Recognition via Large Scale Training

Cited by 94 publications

References 30 publications

Deliberation Model Based Two-Pass End-To-End Speech Recognition

Deliberation Model Based Two-Pass End-To-End Speech Recognition

End-to-End Training of a Large Vocabulary End-to-End Speech Recognition System

Power-Law Nonlinearity with Maximally Uniform Distribution Criterion for Improved Neural Network Training in Automatic Speech Recognition

Contact Info

Product

Resources

About