2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2017
DOI: 10.1109/asru.2017.8268935
|View full text |Cite
|
Sign up to set email alerts
|

Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer

Abstract: We investigate training end-to-end speech recognition models with the recurrent neural network transducer (RNN-T): a streaming, all-neural, sequence-to-sequence architecture which jointly learns acoustic and language model components from transcribed acoustic data. We explore various model architectures and demonstrate how the model can be improved further if additional text or pronunciation data are available. The model consists of an 'encoder', which is initialized from a connectionist temporal classificatio… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

3
245
0

Year Published

2018
2018
2024
2024

Publication Types

Select...
6
4

Relationship

2
8

Authors

Journals

citations
Cited by 305 publications
(248 citation statements)
references
References 16 publications
3
245
0
Order By: Relevance
“…For streaming speech recognition models, recurrent neural networks (RNNs) have been the de facto choice since they can model the temporal dependencies in the audio features effectively [13] THIS IS THE FINAL VERSION OF THE PAPER SUBMITTED TO THE ICASSP 2020 ON OCT 21, 2019. while maintaining a constant computational requirement for each frame. Streamable end-to-end modeling architectures such as the Recurrent Neural Network Transducer (RNN-T) [14,15,16], Recurrent Neural Aligner (RNA) [17], and Neural Transducer [18] utilize an encoder-decoder based framework where both encoder and decoder are layers of RNNs that generate features from audio and labels respectively. In particular, the RNN-T and RNA models are trained to learn alignments between the acoustic encoder features and the label encoder features, and so lend themselves naturally to frame-synchronous decoding.…”
Section: Introductionmentioning
confidence: 99%
“…For streaming speech recognition models, recurrent neural networks (RNNs) have been the de facto choice since they can model the temporal dependencies in the audio features effectively [13] THIS IS THE FINAL VERSION OF THE PAPER SUBMITTED TO THE ICASSP 2020 ON OCT 21, 2019. while maintaining a constant computational requirement for each frame. Streamable end-to-end modeling architectures such as the Recurrent Neural Network Transducer (RNN-T) [14,15,16], Recurrent Neural Aligner (RNA) [17], and Neural Transducer [18] utilize an encoder-decoder based framework where both encoder and decoder are layers of RNNs that generate features from audio and labels respectively. In particular, the RNN-T and RNA models are trained to learn alignments between the acoustic encoder features and the label encoder features, and so lend themselves naturally to frame-synchronous decoding.…”
Section: Introductionmentioning
confidence: 99%
“…End-to-end (E2E) models [2,3,4,5,6,7,8,9] have gained large popularity in the automatic speech recognition (ASR) community over the last few years. These models replace components of a conventional ASR system, namely an acoustic (AM), pronunciation (PM) and language models (LM), with a single neural network.…”
Section: Introductionmentioning
confidence: 99%
“…We also evaluated WERs for various block sizes (L block ), i.e., 4,8,16, and 32, for the naive block Transformer and contextual block Transformer on the WSJ dataset. The block processing was carried out in the half-overlapping manner; thus, L hop = L block /2.…”
Section: Comparison Of Block Sizementioning
confidence: 99%