A Streaming On-Device End-To-End Model Surpassing Server-Side Conventional Model Quality and Latency

Sainath, Tara N.; He, Yanzhang; Li, Bo; Narayanan, Arun; Pang, Ruoming; Bruguier, Antoine; Chang, Shuo-Yiin; Li, Wei; Álvarez, Raziel; Chen, Zhifeng; Chiu, Chung‐Cheng; García, David Espada; Gruenstein, Alex; Hu, Ke; Kannan, Anjuli; Liang, Qiao; McGraw, Ian; Peyser, Cal; Prabhavalkar, Rohit; Pundak, Golan; Rybach, David; Shangguan, Yuan; Sheth, Yash; Strohman, Trevor; Visontai, Mirkó; Wu, Yonghui; Zhang, Yu; Zhao, Ding

doi:10.1109/icassp40776.2020.9054188

Cited by 179 publications

(133 citation statements)

References 22 publications

Supporting

Mentioning

127

Contrasting

Order By: Relevance

“…This has the effect of 'freeing up' space on the beam, while retaining the alternative paths in the final lattice where they can be used for downstream applications. Note that this can have a large impact since end-to-end models are typically decoded with small number of candidates in the beam for efficiency [12], and thus the beam diversity tends to reduce for longer utterances [30]. We note that a similar mechanism has been proposed previously by Zapotoczny et al [21] in the context of lattice generation for attention-based encoder-decoder models, and by Liu et al [20] in the context of efficiently rescoring lattices with neural LMs.…”

Section: Decoding With Path Merging To Create Latticesmentioning

confidence: 99%

“…These models produce hypotheses in an autoregressive fashion by conditioning the output on all previously predicted labels, thus making fewer conditional independence assumptions than conventional hybrid systems. End-to-end ASR models have been shown to achieve state-of-the-art results [9,10] on popular public benchmarks, as well as on on large scale industrial datasets [11,12].…”

Section: Introductionmentioning

confidence: 99%

“…In experimental evaluations, we find that the models require 5-gram contexts (i.e., conditioning on the four previous labels) in order to obtain comparable WER results as the baseline. If lattices obtained from the first-pass system are rescored in the second-pass [12,27] with a listen-attend-and-spell (LAS) system [4], the first-pass RNN-T model can be decoded with only a bigram context (i.e. one previous label) to achieve the same WER as the baseline.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Less is More: Improved RNN-T Decoding Using Limited Label Context and Path Merging

Prabhavalkar

Rybach

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

End-to-end models that condition the output sequence on all previously predicted labels have emerged as popular alternatives to conventional systems for automatic speech recognition (ASR). Since distinct label histories correspond to distinct models states, such models are decoded using an approximate beam-search which produces a tree of hypotheses.In this work, we study the influence of the amount of label context on the model's accuracy, and its impact on the efficiency of the decoding process. We find that we can limit the context of the recurrent neural network transducer (RNN-T) during training to just four previous word-piece labels, without degrading word error rate (WER) relative to the full-context baseline. Limiting context also provides opportunities to improve decoding efficiency by removing redundant paths from the active beam, and instead retaining them in the final lattice. This path-merging scheme can also be applied when decoding the baseline full-context model through an approximation. Overall, we find that the proposed path-merging scheme is extremely effective, allowing us to improve oracle WERs by up to 36% over the baseline, while simultaneously reducing the number of model evaluations by up to 5.3% without any degradation in WER, or up to 15.7% when lattice rescoring is applied.

show abstract

Section: Decoding With Path Merging To Create Latticesmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Less is More: Improved RNN-T Decoding Using Limited Label Context and Path Merging

Prabhavalkar

Rybach

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

show abstract

“…The streaming state is preserved by transferring hidden recurrent network states from one decoding instance to the next. In [36], an end-to-end streaming model with RNN Transducers [13] is used to jointly model linguistic and acoustic features by using the previous labels along with the audio features. For training wordpieces [34,39] is used, where words are further segmented into sub-word units.…”

Section: Background and Related Workmentioning

confidence: 99%

A Low footprint Automatic Speech Recognition System For Resource Constrained Edge Devices

Dey

Dutta

2020

Proceedings of the 2nd International Workshop on Challenges in Artificial Intelligence and Machine Learning for Internet of Thi

View full text Add to dashboard Cite

Deep Learning (DL) has been instrumental in pushing artificial intelligence (AI)/ machine learning (ML) algorithms to edge of the network. It allows building AI/ML algorithms for computer vision, speech processing, and other timeseries analytics tasks with limited domain knowledge. As there is no mechanism to control the representations learned from a large dataset, it becomes hard to predict whether a very small DL model can learn the proper dependencies needed for a particular problem at hand. With speech recognition capability becoming important in several Internet of Things (IoT) devices, we propose an explainable AI-based methodology to build small DL models for speech recognition by controlling the representations learned by a model under a hard size constraint. We enhance the architecture of a state of the art sequence transduction model to allow the tuning of accuracy vs. model size tradeoff. Using these techniques we achieve a reduction in model size and latency by a factor of 10 and 6 respectively, with only 4loss compared to the embedded implementation of a well known ASR. CCS CONCEPTS • Human-centered computing → Ubiquitous and mobile computing design and evaluation methods; • Computing methodologies → Speech recognition; Neural networks; • Computer systems organization → Embedded software.

show abstract

“…The increasing omnipresence of smartphones, smart speakers, and tablets coupled with the adoption of voice assistants has motivated a modern trend to develop Automatic Speech Recognition (ASR) systems which fully operate on local devices [1,2,3]. The promise of on-device ASR includes increased reliability, improved latency and privacy benefits by alleviating the need to stream audio to servers.…”

Section: Introductionmentioning

confidence: 99%

Bifocal Neural ASR: Exploiting Keyword Spotting for Inference Optimization

Macoskey¹,

Strimel²,

Rastrow³

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

We present Bifocal RNN-T, a new variant of the Recurrent Neural Network Transducer (RNN-T) architecture designed for improved inference time latency on speech recognition tasks. The architecture enables a dynamic pivot for its runtime compute pathway, namely taking advantage of keyword spotting to select which component of the network to execute for a given audio frame. To accomplish this, we leverage a recurrent cell we call the Bifocal LSTM (BF-LSTM), which we detail in the paper. The architecture is compatible with other optimization strategies such as quantization, sparsification, and applying time-reduction layers, making it especially applicable for deployed, real-time speech recognition settings. We present the architecture and report comparative experimental results on voice-assistant speech recognition tasks. Specifically, we show our proposed Bifocal RNN-T can improve inference cost by 29.1% with matching word error rates and only a minor increase in memory size.

show abstract

A Streaming On-Device End-To-End Model Surpassing Server-Side Conventional Model Quality and Latency

Cited by 179 publications

References 22 publications

Less is More: Improved RNN-T Decoding Using Limited Label Context and Path Merging

Less is More: Improved RNN-T Decoding Using Limited Label Context and Path Merging

A Low footprint Automatic Speech Recognition System For Resource Constrained Edge Devices

Bifocal Neural ASR: Exploiting Keyword Spotting for Inference Optimization

Contact Info

Product

Resources

About