Interspeech 2021 2021
DOI: 10.21437/interspeech.2021-206
|View full text |Cite
|
Sign up to set email alerts
|

An Efficient Streaming Non-Recurrent On-Device End-to-End Model with Improvements to Rare-Word Modeling

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
30
1

Year Published

2022
2022
2024
2024

Publication Types

Select...
6
2

Relationship

2
6

Authors

Journals

citations
Cited by 38 publications
(31 citation statements)
references
References 0 publications
0
30
1
Order By: Relevance
“…The data selection pipelines were written in Apache Beam and run on Cloud Dataflow [11]. Our E2E model, described fully in [12], is a 150Mparameter streaming RNNT [13] emitting 4096 lower-case wordpieces. The encoder is a cascaded [14] Conformer [15] and the decoder is a stateless embedding decoder [16].…”
Section: Resultsmentioning
confidence: 99%
“…The data selection pipelines were written in Apache Beam and run on Cloud Dataflow [11]. Our E2E model, described fully in [12], is a 150Mparameter streaming RNNT [13] emitting 4096 lower-case wordpieces. The encoder is a cascaded [14] Conformer [15] and the decoder is a stateless embedding decoder [16].…”
Section: Resultsmentioning
confidence: 99%
“…With the balanced size of causal and non-causal encoders, we show that it improves quality and fits deployment constraints better in Section 4.3. Our large-medium model only has around 70% of model size, compared to the previous models in [13,14]. Similarly, the large-medium-small super-net is comprised of a 20M causal encoder for the small sub-model, an additional 26.8M causal encoder for the medium sub-model, and a final 60M noncausal encoder for the large sub-model, as shown in Figure 3.…”
Section: Dynamic Cascaded Encoder Model In Practicementioning
confidence: 99%
“…Recently, we presented an on-device E2E model based on a two-pass cascaded encoder which outperforms a conventional model in terms of word error rate (WER) on both search and long-tail queries, as well as endpointer latency metrics [13]. We further adapted the cascaded encoder to a small 1st-pass (50M parameters) large 2nd-pass (100M parameters) architecture to improve computational latency for both cloud and edge tensor processing units (TPUs), while maintaining quality [14].…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…Our speech recognizer is a streaming Conformer acoustic model [26] via HAT shallow fusion [28], and we measure performance on a ∼10k sample of Google voice search traffic with natively capitalized reference transcripts. When the acoustic model is fixed and the output tokens are cased, we show that improving capitalization normalization of the language model training data leads to reduction of upper-case error rate (UER), the character error rate of each capitalized character in either the predicted or reference transcript (Table 5).…”
Section: Case-aware Language Models In Speech Recognitionmentioning
confidence: 99%