2022
DOI: 10.48550/arxiv.2204.10586
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Efficient Training of Neural Transducer for Speech Recognition

Abstract: As one of the most popular sequence-to-sequence modeling approaches for speech recognition, the RNN-Transducer has achieved evolving performance with more and more sophisticated neural network models of growing size and increasing training epochs. While strong computation resources seem to be the prerequisite of training superior models, we try to overcome it by carefully designing a more efficient training pipeline. In this work, we propose an efficient 3-stage progressive training pipeline to build highly-pe… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
4
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
1
1

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(5 citation statements)
references
References 34 publications
1
4
0
Order By: Relevance
“…As shown in Figure 10, on Librispeech, the current best-published classical hybrid systems range around 2.3% (test-clean) and 4.9% (test-other) word error rate [323], [222], while there already are a number of E2E systems providing similar performance [224], [205], [320], [206], with some E2E systems clearly outperforming former state-of-the-art results with word error rates down to 1.8% (test-clean) and 3.7% (test-other) [324] with similar results reported in [45], [97]. Merging insights from classical HMM-based and monotonic RNN-T provided similarly well results with a limited training budget [124]. Finally, when trained on Switchboard 300h, the current best result, obtained with an E2E system [180] is 5.4% compared to 6.6% word error rate for the best hybrid system result [325] on the HUB5'00 Switchboard test set, in Figure 9 IX.…”
Section: Relationship To Classical Asrsupporting
confidence: 65%
See 1 more Smart Citation
“…As shown in Figure 10, on Librispeech, the current best-published classical hybrid systems range around 2.3% (test-clean) and 4.9% (test-other) word error rate [323], [222], while there already are a number of E2E systems providing similar performance [224], [205], [320], [206], with some E2E systems clearly outperforming former state-of-the-art results with word error rates down to 1.8% (test-clean) and 3.7% (test-other) [324] with similar results reported in [45], [97]. Merging insights from classical HMM-based and monotonic RNN-T provided similarly well results with a limited training budget [124]. Finally, when trained on Switchboard 300h, the current best result, obtained with an E2E system [180] is 5.4% compared to 6.6% word error rate for the best hybrid system result [325] on the HUB5'00 Switchboard test set, in Figure 9 IX.…”
Section: Relationship To Classical Asrsupporting
confidence: 65%
“…Dedicated training schedules have been developed to guide the optimization process and as part of that reach proper alignment behavior explicitly or implicitly [52], [124], [147]. Many approaches exist for learning rate control [161], [162]: NewBob [163], [164] and enhancements [162]; global versus parameter-wise learning rate control (exponential decay, power decay, etc.)…”
Section: E Training Schedules and Curriculamentioning
confidence: 99%
“…Merging insights from classical HMM-based and monotonic RNN-T provided similarly well results with a limited training budget [94]. Finally, when trained on Switchboard 300h, the current best result, obtained with an E2E system [151] is 5.4% compared to 6.6% word error rate for the best hybrid system result [306] on the HUB5'00 Switchboard test set.…”
Section: Use Of Large-scale Pretrained Lmsmentioning
confidence: 59%
“…( 4) by a single term corresponding to the alignment sequence generated. Recently, a similar procedure has been introduced in [94] also employing E2E models, only. In this work, CTC is used to initialize the training and to generate an initial alignment, followed by intermediate Viterbi-style RNN-T training and final fullsum fine tuning.…”
Section: A Alignment In Trainingmentioning
confidence: 99%
See 1 more Smart Citation