ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9054188
|View full text |Cite
|
Sign up to set email alerts
|

A Streaming On-Device End-To-End Model Surpassing Server-Side Conventional Model Quality and Latency

Abstract: Thus far, end-to-end (E2E) models have not been shown to outperform state-of-the-art conventional models with respect to both quality, i.e., word error rate (WER), and latency, i.e., the time the hypothesis is finalized after the user stops speaking. In this paper, we develop a first-pass Recurrent Neural Network Transducer (RNN-T) model and a second-pass Listen, Attend, Spell (LAS) rescorer that surpasses a conventional model in both quality and latency. On the quality side, we incorporate a large number of u… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

2
127
2

Year Published

2020
2020
2023
2023

Publication Types

Select...
5
2
1

Relationship

1
7

Authors

Journals

citations
Cited by 179 publications
(133 citation statements)
references
References 22 publications
2
127
2
Order By: Relevance
“…This has the effect of 'freeing up' space on the beam, while retaining the alternative paths in the final lattice where they can be used for downstream applications. Note that this can have a large impact since end-to-end models are typically decoded with small number of candidates in the beam for efficiency [12], and thus the beam diversity tends to reduce for longer utterances [30]. We note that a similar mechanism has been proposed previously by Zapotoczny et al [21] in the context of lattice generation for attention-based encoder-decoder models, and by Liu et al [20] in the context of efficiently rescoring lattices with neural LMs.…”
Section: Decoding With Path Merging To Create Latticesmentioning
confidence: 99%
See 2 more Smart Citations
“…This has the effect of 'freeing up' space on the beam, while retaining the alternative paths in the final lattice where they can be used for downstream applications. Note that this can have a large impact since end-to-end models are typically decoded with small number of candidates in the beam for efficiency [12], and thus the beam diversity tends to reduce for longer utterances [30]. We note that a similar mechanism has been proposed previously by Zapotoczny et al [21] in the context of lattice generation for attention-based encoder-decoder models, and by Liu et al [20] in the context of efficiently rescoring lattices with neural LMs.…”
Section: Decoding With Path Merging To Create Latticesmentioning
confidence: 99%
“…These models produce hypotheses in an autoregressive fashion by conditioning the output on all previously predicted labels, thus making fewer conditional independence assumptions than conventional hybrid systems. End-to-end ASR models have been shown to achieve state-of-the-art results [9,10] on popular public benchmarks, as well as on on large scale industrial datasets [11,12].…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…The streaming state is preserved by transferring hidden recurrent network states from one decoding instance to the next. In [36], an end-to-end streaming model with RNN Transducers [13] is used to jointly model linguistic and acoustic features by using the previous labels along with the audio features. For training wordpieces [34,39] is used, where words are further segmented into sub-word units.…”
Section: Background and Related Workmentioning
confidence: 99%
“…The increasing omnipresence of smartphones, smart speakers, and tablets coupled with the adoption of voice assistants has motivated a modern trend to develop Automatic Speech Recognition (ASR) systems which fully operate on local devices [1,2,3]. The promise of on-device ASR includes increased reliability, improved latency and privacy benefits by alleviating the need to stream audio to servers.…”
Section: Introductionmentioning
confidence: 99%