2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2019
DOI: 10.1109/asru46091.2019.9003906
|View full text |Cite
|
Sign up to set email alerts
|

Improving RNN Transducer Modeling for End-to-End Speech Recognition

Abstract: In the last few years, an emerging trend in automatic speech recognition research is the study of end-to-end (E2E) systems. Connectionist Temporal Classification (CTC), Attention Encoder-Decoder (AED), and RNN Transducer (RNN-T) are the most popular three methods. Among these three methods, RNN-T has the advantages to do online streaming which is challenging to AED and it doesn't have CTC's frame-independence assumption. In this paper, we improve the RNN-T training in two aspects. First, we optimize the traini… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
88
0
1

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
3
2

Relationship

3
6

Authors

Journals

citations
Cited by 155 publications
(89 citation statements)
references
References 40 publications
0
88
0
1
Order By: Relevance
“…When compared to a conventional baseline hybrid setup [14] that trains an LSTM with the CE and then the MMI criteria, the two-head cltLSTM-12 reduces the WER from 13.01 to 9.34%, which is a 28.2% relative reduction. While we are also working on replacing hybrid models with E2E models [35], the work conducted in this paper indeed presents us a super challenging hybrid model baseline to beat.…”
Section: Two-head Cltlstmmentioning
confidence: 99%
“…When compared to a conventional baseline hybrid setup [14] that trains an LSTM with the CE and then the MMI criteria, the two-head cltLSTM-12 reduces the WER from 13.01 to 9.34%, which is a 28.2% relative reduction. While we are also working on replacing hybrid models with E2E models [35], the work conducted in this paper indeed presents us a super challenging hybrid model baseline to beat.…”
Section: Two-head Cltlstmmentioning
confidence: 99%
“…The input feature is a vector of 80-dimension log Mel filter bank for every 10 milliseconds (ms) of speech. Eight vectores are stacked together to form an input frame to the encoder, and the frame shift is 30 ms. All RNN-T models adopt the configuration recommended in [22,28]. All encoders (Enc.)…”
Section: Methodsmentioning
confidence: 99%
“…RNN-T overcomes the conditional independence assumption of CTC with the prediction network; moreover, it allows streaming ASR because it still preforms frame-level monotonic decoding. Hence, there has been a significant research effort in promoting this approach in the ASR community [22,21,25,26,27], and RNN-T has recently been successfully deployed in embedding devices [28].…”
Section: Introductionmentioning
confidence: 99%
“…Since 2015, there has been a significant trend in the field moving from hybrid HMM/NN systems to end-to-end (E2E) NN modeling [4], [6], [18]- [24] for ASR. E2E systems are characterized by the use of a single model transforming the input acoustic feature stream to a target stream of output tokens, which might be constructed of characters, subwords, or even words.…”
Section: B End-to-end Systemsmentioning
confidence: 99%