Interspeech 2019 2019
DOI: 10.21437/interspeech.2019-1341
|View full text |Cite
|
Sign up to set email alerts
|

Two-Pass End-to-End Speech Recognition

Abstract: The requirements for many applications of state-of-the-art speech recognition systems include not only low word error rate (WER) but also low latency. Specifically, for many use-cases, the system must be able to decode utterances in a streaming fashion and faster than real-time. Recently, a streaming recurrent neural network transducer (RNN-T) end-to-end (E2E) model has shown to be a good candidate for on-device speech recognition, with improved WER and latency metrics compared to conventional on-device models… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
90
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
3
2
1

Relationship

1
5

Authors

Journals

citations
Cited by 111 publications
(91 citation statements)
references
References 26 publications
1
90
0
Order By: Relevance
“…The proposed 2-pass E2E architecture [10] is shown in Figure 1. Let us denote input acoustic frames as x = (x1 .…”
Section: Model Architecturementioning
confidence: 99%
See 4 more Smart Citations
“…The proposed 2-pass E2E architecture [10] is shown in Figure 1. Let us denote input acoustic frames as x = (x1 .…”
Section: Model Architecturementioning
confidence: 99%
“…Our E2E model is trained on audio-text pairs only, which is a small fraction of data compared to the trillion-word text-only data a conventional LM is trained with. Previous work [2,10] used only search utterances. To increase vocabulary and diversity of training data, we explore using more data by incorporating multi-domain utterances as described in [1].…”
Section: Multi-domain Datamentioning
confidence: 99%
See 3 more Smart Citations