An Efficient Streaming Non-Recurrent On-Device End-to-End Model with Improvements to Rare-Word Modeling

Sainath, Tara N.; He, Yanzhang; Narayanan, Arun; Botros, Rami; Pang, Ruoming; Rybach, David; Allauzen, Cyril; Variani, Ehsan; Qin, James; Le-The, Quoc-Nam; Gruenstein, Alex; Gulati, Anmol; Li, Bo; Peyser, Cal; Chiu, Chung‐Cheng; Caseiro, Diamantino; Guzman, Emmanuel; McGraw, Ian; Yu, Jiahui; Riley, Michael; Rondon, Pat; Liang, Qiao; Sepand, Mavandadi,; Chang, Shuo-Yiin; Strohman, Trevor; Huang, Wen-Chin; Li, Wei; Wu, Yonghui; Yu, Zhang

doi:10.21437/interspeech.2021-206

Cited by 38 publications

(31 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The data selection pipelines were written in Apache Beam and run on Cloud Dataflow [11]. Our E2E model, described fully in [12], is a 150Mparameter streaming RNNT [13] emitting 4096 lower-case wordpieces. The encoder is a cascaded [14] Conformer [15] and the decoder is a stateless embedding decoder [16].…”

Section: Resultsmentioning

confidence: 99%

Sentence-Select: Large-Scale Language Model Data Selection for Rare-Word Speech Recognition

Peyser¹,

Sainath²,

Pang³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Language model fusion helps smart assistants recognize words which are rare in acoustic data but abundant in textonly corpora (typed search logs). However, such corpora have properties that hinder downstream performance, including being (1) too large, (2) beset with domain-mismatched content, and (3) heavy-headed rather than heavy-tailed (excessively many duplicate search queries such as "weather"). We show that three simple strategies for selecting language modeling data can dramatically improve rare-word recognition without harming overall performance. First, to address the heavy-headedness, we downsample the data according to a soft log function, which tunably reduces high frequency (head) sentences. Second, to encourage rare-word exposure, we explicitly filter for words rare in the acoustic data. Finally, we tackle domain-mismatch via perplexity-based contrastive selection, filtering for examples matched to the target domain. We down-select a large corpus of web search queries by a factor of 53x and achieve better LM perplexities than without down-selection. When shallow-fused with a state-of-the-art, production speech engine, our LM achieves WER reductions of up to 24% relative on rare-word sentences (without changing overall WER) compared to a baseline LM trained on the raw corpus. These gains are further validated through favorable side-by-side evaluations on live voice search traffic.

show abstract

Section: Resultsmentioning

confidence: 99%

Sentence-Select: Large-Scale Language Model Data Selection for Rare-Word Speech Recognition

Peyser¹,

Sainath²,

Pang³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…With the balanced size of causal and non-causal encoders, we show that it improves quality and fits deployment constraints better in Section 4.3. Our large-medium model only has around 70% of model size, compared to the previous models in [13,14]. Similarly, the large-medium-small super-net is comprised of a 20M causal encoder for the small sub-model, an additional 26.8M causal encoder for the medium sub-model, and a final 60M noncausal encoder for the large sub-model, as shown in Figure 3.…”

Section: Dynamic Cascaded Encoder Model In Practicementioning

confidence: 99%

“…Recently, we presented an on-device E2E model based on a two-pass cascaded encoder which outperforms a conventional model in terms of word error rate (WER) on both search and long-tail queries, as well as endpointer latency metrics [13]. We further adapted the cascaded encoder to a small 1st-pass (50M parameters) large 2nd-pass (100M parameters) architecture to improve computational latency for both cloud and edge tensor processing units (TPUs), while maintaining quality [14].…”

Section: Introductionmentioning

confidence: 99%

“…To support such diversity of scenarios, we propose an approach by extending the cascaded encoder architecture in [13] to unify multiple size configurations in a single model during training. By only running a subset of the model layers at inference time, the model can be executed as different sizes with similar accuracies as the independently trained models of the corresponding sizes.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A Unified Cascaded Encoder ASR Model for Dynamic Model Sizes

Ding¹,

Wang²,

Zhao³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

In this paper, we propose a dynamic cascaded encoder Automatic Speech Recognition (ASR) model, which unifies models for different deployment scenarios. Moreover, the model can significantly reduce model size and power consumption without loss of quality. Namely, with the dynamic cascaded encoder model, we explore three techniques to maximally boost the performance of each model size: 1) Use separate decoders for each sub-model while sharing the encoders; 2) Use funnelpooling to improve the encoder efficiency; 3) Balance the size of causal and non-causal encoders to improve quality and fit deployment constraints. Overall, the proposed large-medium model has 30% smaller size and reduces power consumption by 33%, compared to the baseline cascaded encoder model. The triple-size model that unifies the large, medium, and small models achieves 37% total size reduction with minimal quality loss, while substantially reducing the engineering efforts of having separate models.

show abstract

“…Our speech recognizer is a streaming Conformer acoustic model [26] via HAT shallow fusion [28], and we measure performance on a ∼10k sample of Google voice search traffic with natively capitalized reference transcripts. When the acoustic model is fixed and the output tokens are cased, we show that improving capitalization normalization of the language model training data leads to reduction of upper-case error rate (UER), the character error rate of each capitalized character in either the predicted or reference transcript (Table 5).…”

Section: Case-aware Language Models In Speech Recognitionmentioning

confidence: 99%

Capitalization Normalization for Language Modeling with an Accurate and Efficient Hierarchical RNN Model

Zhang¹,

Cheng²,

Kumar³

et al. 2022

Preprint

View full text Add to dashboard Cite

Capitalization normalization (truecasing) is the task of restoring the correct case (uppercase or lowercase) of noisy text. We propose a fast, accurate and compact two-level hierarchical word-and-character-based recurrent neural network model. We use the truecaser to normalize user-generated text in a Federated Learning framework for language modeling. A caseaware language model trained on this normalized text achieves the same perplexity as a model trained on text with gold capitalization. In a real user A/B experiment, we demonstrate that the improvement translates to reduced prediction error rates in a virtual keyboard application. Similarly, in an ASR language model fusion experiment, we show reduction in uppercase character error rate and word error rate.

show abstract

An Efficient Streaming Non-Recurrent On-Device End-to-End Model with Improvements to Rare-Word Modeling

Cited by 38 publications

References 0 publications

Sentence-Select: Large-Scale Language Model Data Selection for Rare-Word Speech Recognition

Sentence-Select: Large-Scale Language Model Data Selection for Rare-Word Speech Recognition

A Unified Cascaded Encoder ASR Model for Dynamic Model Sizes

Capitalization Normalization for Language Modeling with an Accurate and Efficient Hierarchical RNN Model

Contact Info

Product

Resources

About