Attention Based On-Device Streaming Speech Recognition with Large Speech Corpus

Kim, Kwangyoun; Jung, Seokyeong; Lee, Jungin; Han, Myoungji; Kim, Chanwoo; Lee, Kyung-Min; Gowda, Dhananjaya; Park, Jun-Mo; Kim, Sung‐Soo; Jin, Sichen; Lee, Young-Yoon; Yeo, Jinsu; Kim, Dae Hyun

doi:10.1109/asru46091.2019.9004027

Cited by 60 publications

(56 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As shown in Fig. 2, we apply the "global" mean and variance normalization as in [17], since the utterance-by-utterance mean and variance normalizations are not easily realizable for streaming speech recognition [18]. Note that mean subtraction must be applied before masking, otherwise, the non-zero values in the masked region will distort the model during the training.…”

Section: Small Energy Masking Algorithmmentioning

confidence: 99%

Small Energy Masking for Improved Neural Network Training for End-To-End Speech Recognition

Kim

Indurthi

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

In this paper, we present a Small Energy Masking (SEM) algorithm, which masks inputs having values below a certain threshold. More specifically, a time-frequency bin is masked if the filterbank energy in this bin is less than a certain energy threshold. A uniform distribution is employed to randomly generate the ratio of this energy threshold to the peak filterbank energy of each utterance in decibels. The unmasked feature elements are scaled so that the total sum of the feature values remain the same through this masking procedure. This very simple algorithm shows relatively 11.2 % and 13.5 % Word Error Rate (WER) improvements on the standard Lib-riSpeech test-clean and test-other sets over the baseline end-to-end speech recognition system. Additionally, compared to the input dropout algorithm, SEM algorithm shows relatively 7.7 % and 11.6 % improvements on the same LibriSpeech test-clean and test-other sets. With a modified shallow-fusion technique with a Transformer LM, we obtained a 2.62 % WER on the Lib-riSpeech test-clean set and a 7.87 % WER on the LibriSpeech test-other set.

show abstract

Section: Small Energy Masking Algorithmmentioning

confidence: 99%

Small Energy Masking for Improved Neural Network Training for End-To-End Speech Recognition

Kim

Indurthi

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

show abstract

“…We have tried various types of training strategies for better performance [53,54]. Our MoCha implementation and optimization are described in very detail in our another paper [50]. The structure of our entire end-to-end speech recognition system is shown in Fig.…”

Section: Structure Of the End-to-end Speech Recognition Systemmentioning

confidence: 99%

End-to-End Training of a Large Vocabulary End-to-End Speech Recognition System

Kim

Shin

Singh

et al. 2019

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

Self Cite

View full text Add to dashboard Cite

In this paper, we present an end-to-end training framework for building state-of-the-art end-to-end speech recognition systems.Our training system utilizes a cluster of Central Processing Units (CPUs) and Graphics Processing Units (GPUs). The entire data reading, large scale data augmentation, neural network parameter updates are all performed "on-the-fly". We use vocal tract length perturbation [1] and an acoustic simulator [2] for data augmentation. The processed features and labels are sent to the GPU cluster. The Horovod allreduce approach is employed to train neural network parameters. We evaluated the effectiveness of our system on the standard Librispeech corpus [3] and the 10,000-hr anonymized Bixby English dataset. Our end-to-end speech recognition system built using this training infrastructure showed a 2.44 % WER on test-clean of the LibriSpeech test set after applying shallow fusion with a Transformer language model (LM). For the proprietary English Bixby open domain test set, we obtained a WER of 7.92 % using a Bidirectional Full Attention (BFA) end-to-end model after applying shallow fusion with an RNN-LM. When the monotonic chunckwise attention (MoCha) based approach is employed for streaming speech recognition, we obtained a WER of 9.95 % on the same Bixby open domain test set.

show abstract

“…We use an end-to-end attention based ASR model [11,12] with an architecture similar to the one proposed in [10] as depicted in Fig. 1.…”

Section: Asr Modelmentioning

confidence: 99%

“…t . c o h o r t m o d e l s 2 f o r u t t e r a n c e i n v a l i d a t i o n s e t s : c o r r e l a t i o n ( w e r u t t , w e r a v g a l l ) 9 10 # f i l t e r u t t e r a n c e s < m i n c o r r e l and m i n l e n g t h 11 f o r u t t e r a n c e i n v a l i d a t i o n s e t s : 12 i f c o r r e l [ u t t e r a n c e ] > c o r r e l m i n : 13 n e w s e t . a p p e n d ( s a m p l e ) Listing 1: Heuristic to find "condensed" datasets in Fig.…”

Section: Small Dataset Creationmentioning

confidence: 99%

ShrinkML: End-to-End ASR Model Compression Using Reinforcement Learning

et al. 2019

View full text Add to dashboard Cite

End-to-end automatic speech recognition (ASR) models are increasingly large and complex to achieve the best possible accuracy. In this paper, we build an AutoML system that uses reinforcement learning (RL) to optimize the per-layer compression ratios when applied to a state-of-the-art attention based end-to-end ASR model composed of several LSTM layers. We use singular value decomposition (SVD) low-rank matrix factorization as the compression method. For our RL-based Au-toML system, we focus on practical considerations such as the choice of the reward/punishment functions, the formation of an effective search space, and the creation of a representative but small data set for quick evaluation between search steps. Finally, we present accuracy results on LibriSpeech of the model compressed by our AutoML system, and we compare it to manually-compressed models. Our results show that in the absence of retraining our RL-based search is an effective and practical method to compress a production-grade ASR system. When retraining is possible, we show that our AutoML system can select better highly-compressed seed models compared to manually hand-crafted rank selection, thus allowing for more compression than previously possible.

show abstract

Attention Based On-Device Streaming Speech Recognition with Large Speech Corpus

Cited by 60 publications

References 18 publications

Small Energy Masking for Improved Neural Network Training for End-To-End Speech Recognition

Small Energy Masking for Improved Neural Network Training for End-To-End Speech Recognition

End-to-End Training of a Large Vocabulary End-to-End Speech Recognition System

ShrinkML: End-to-End ASR Model Compression Using Reinforcement Learning

Contact Info

Product

Resources

About