Two-Pass End-to-End Speech Recognition

C, Sainath Tara; Pang, Ruoming; Rybach, David; He, You; Prabhavalkar, Rohit; Li, Wei; Visontai, Mirkó; Liang, Qiao; Strohman, Trevor; Wu, Yonghui; C, Mcgraw Ian; Chiu, Chung‐Cheng

doi:10.21437/interspeech.2019-1341

Cited by 111 publications

(91 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The proposed 2-pass E2E architecture [10] is shown in Figure 1. Let us denote input acoustic frames as x = (x1 .…”

Section: Model Architecturementioning

confidence: 99%

“…Our E2E model is trained on audio-text pairs only, which is a small fraction of data compared to the trillion-word text-only data a conventional LM is trained with. Previous work [2,10] used only search utterances. To increase vocabulary and diversity of training data, we explore using more data by incorporating multi-domain utterances as described in [1].…”

Section: Multi-domain Datamentioning

confidence: 99%

“…Our past work has explored using an exponentially-decaying learning rate when training both RNN-T and LAS [2,10]. Given the increased amount of multi-domain training data compared to search-only data, we explore using a constant learning rate.…”

Section: Learning Ratesmentioning

confidence: 99%

“…However, LAS models are not streaming as they must attend to the entire audio segment. Recently, a 2-pass RNN-T+LAS model was proposed in [10], where LAS rescores hypotheses from RNN-T. This model was shown to abide by user interaction constraints, and offer comparable performance to a conventional model.…”

Section: Introductionmentioning

confidence: 99%

“…In this paper, we extend on the work from [10] in several directions, to develop an on-device E2E model that surpasses a conventional model [11] in both WER and latency. First, on the quality-front, we train our model on multi-domain audio-text utterance pairs, utilizing sources from different domains including search traffic, telephony data and YouTube data [1].…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

A Streaming On-Device End-To-End Model Surpassing Server-Side Conventional Model Quality and Latency

Sainath

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

179

127

View full text Add to dashboard Cite

Thus far, end-to-end (E2E) models have not been shown to outperform state-of-the-art conventional models with respect to both quality, i.e., word error rate (WER), and latency, i.e., the time the hypothesis is finalized after the user stops speaking. In this paper, we develop a first-pass Recurrent Neural Network Transducer (RNN-T) model and a second-pass Listen, Attend, Spell (LAS) rescorer that surpasses a conventional model in both quality and latency. On the quality side, we incorporate a large number of utterances across varied domains [1] to increase acoustic diversity and the vocabulary seen by the model. We also train with accented English speech to make the model more robust to different pronunciations. In addition, given the increased amount of training data, we explore a varied learning rate schedule. On the latency front, we explore using the end-of-sentence decision emitted by the RNN-T model to close the microphone, and also introduce various optimizations to improve the speed of LAS rescoring. Overall, we find that RNN-T+LAS offers a better WER and latency tradeoff compared to a conventional model. For example, for the same latency, RNN-T+LAS obtains a 8% relative improvement in WER, while being more than 400-times smaller in model size.

show abstract

“…The proposed 2-pass E2E architecture [10] is shown in Figure 1. Let us denote input acoustic frames as x = (x1 .…”

Section: Model Architecturementioning

confidence: 99%