2020
DOI: 10.48550/arxiv.2007.15188
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Developing RNN-T Models Surpassing High-Performance Hybrid Models with Customization Capability

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
14
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 10 publications
(14 citation statements)
references
References 27 publications
0
14
0
Order By: Relevance
“…The model also contains a tuned word reward which penalizes shorter utterances. 2 The data was processed such that the users are not identifiable…”
Section: N-gram Pruningmentioning
confidence: 99%
See 1 more Smart Citation
“…The model also contains a tuned word reward which penalizes shorter utterances. 2 The data was processed such that the users are not identifiable…”
Section: N-gram Pruningmentioning
confidence: 99%
“…Hybrid Automatic Speech Recognition (ASR) models consist of separately trained models for acoustics, pronunciations and language, [1,2], whereas end-to-end (E2E) models integrate these components into a single network [3,4,5], enabling end-to-end training and optimization. Latest advancements in the ASR technology have popularized E2E models like RNN-T as they provide state-of-the-art performance across a wide variety of streaming applications [4].…”
Section: Introductionmentioning
confidence: 99%
“…Representative models include streaming models such as the recurrent neural network transducer (RNN-T) [1], attention-based models [8,2,3], and transformer-based models [9,10,11,12]. Compared to sophisticated conventional models [13,14], E2E models such as RNN-T and Listen, Attend and Spell (LAS) have shown competitive performance [6,5,7,15]. To further improve recognition accuracy, a two-pass LAS rescoring model has been proposed in [16], which uses a nonstreaming LAS decoder to rescore the RNN-T hypotheses.…”
Section: Introductionmentioning
confidence: 99%
“…Transformers have also been applied in post-processing for E2E models. [25,15] use transformer for spelling correction. [24] applies transformer decoder in second-pass rescoring.…”
Section: Introductionmentioning
confidence: 99%
“…Several second pass rescoring methods have also been studied for improving RNN-T models. In addition to rescoring with regular neural LM [9,10], a Listen, Attend, and Spell (LAS) component [11,12] (which attends to acoustics) and recently a deliberation model [13] (which attends to both acoustics and first-pass hypotheses) have been used. A recent study [14] obtains further gains by modifying MWER loss criteria for the two-pass framework with LAS targeted at improving recognition of proper noun.…”
Section: Introductionmentioning
confidence: 99%