Efficient minimum word error rate training of RNN-Transducer for end-to-end speech recognition

Guo, Jianmin; Tiwari, Gautam; Droppo, Jasha; Segbroeck, Maarten Van; Huang, Chien‐Chang; Stolcke, Andreas; Maas, Roland

doi:10.48550/arxiv.2007.13802

Cited by 6 publications

(14 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…All models are trained using the Adam optimizer [23], with a learning rate schedule including an initial linear warm-up phase, a constant phase, and an exponential decay phase [4]. All the baseline models and proposed methods use the same training strategy.…”

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

REDAT: Accent-Invariant Representation for End-To-End ASR by Domain Adversarial Training with Relabeling

Yang

Raeesy

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

Accents mismatching is a critical problem for end-to-end ASR. This paper aims to address this problem by building an accent-robust RNN-T system with domain adversarial training (DAT). We unveil the magic behind DAT and provide, for the first time, a theoretical guarantee that DAT learns accentinvariant representations. We also prove that performing the gradient reversal in DAT is equivalent to minimizing the Jensen-Shannon divergence between domain output distributions. Motivated by the proof of equivalence, we introduce reDAT, a novel technique based on DAT, which relabels data using either unsupervised clustering or soft labels. Experiments on 23K hours of multi-accent data show that DAT achieves competitive results over accent-specific baselines on both native and non-native English accents but up to 13% relative WER reduction on unseen accents; our reDAT yields further improvements over DAT by 3% and 8% relatively on non-native accents of American and British English.

show abstract

Section: Methodsmentioning

confidence: 99%

“…Recent application of recurrent neural network transducers (RNN-T) has achieved significant progress in the area of online streaming end-to-end automatic speech recognition (ASR) [1][2][3][4]. However, building an accent-robust system remains a big challenge.…”

Section: Introductionmentioning

confidence: 99%

REDAT: Accent-Invariant Representation for End-To-End ASR by Domain Adversarial Training with Relabeling

Yang

Raeesy

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

show abstract

“…We leave the effect of optimizing interpolation weights for best overall perplexity of OOD data as future work. [17,18]. For shallow fusion with a WFST, we use the lookahead approach described in [19] as it avoids unnecessary arc expansion and provides a heuristic approach to perform subword-level rescoring without the need to build the boosting FST directly at the subword level.…”

Section: N-gram Pruningmentioning

confidence: 99%

A Likelihood Ratio based Domain Adaptation Method for E2E Models

Choudhury¹,

Gandhe²,

Ding³

et al. 2022

Preprint

View full text Add to dashboard Cite

End-to-end (E2E) automatic speech recognition models like Recurrent Neural Networks Transducer (RNN-T) are becoming a popular choice for streaming ASR applications like voice assistants. While E2E models are very effective at learning representation of the training data they are trained on, their accuracy on unseen domains remains a challenging problem. Additionally, these models require paired audio and text training data, are computationally expensive and are difficult to adapt towards the fast evolving nature of conversational speech. In this work, we explore a contextual biasing approach using likelihood-ratio that leverages text data sources to adapt RNN-T model to new domains and entities. We show that this method is effective in improving rare words recognition, and results in a relative improvement of 10% in 1-best word error rate (WER) and 10% in n-best Oracle 1 WER (n=8) on multiple out-of-domain datasets without any degradation on a general dataset. We also show that complementing the contextual biasing adaptation with adaptation of a second-pass rescoring model gives additive WER improvements.

show abstract

“…E2E models are commonly trained to maximize the log posteriors of token sequences given speech sequences while the ASR performance is measured by the word error rate (WER). Therefore, a minimum WER (MWER) criterion was proposed to train CTC [14], AED [15], RNN-T [16,17] and hybrid autoregressive transducer (HAT) [18] models, leading to improved ASR performance.…”

Section: Introductionmentioning

confidence: 99%

Minimum Word Error Rate Training with Language Model Fusion for End-to-End Speech Recognition

Meng

Kanda

et al. 2021

Preprint

View full text Add to dashboard Cite

Integrating external language models (LMs) into end-to-end (E2E) models remains a challenging task for domain-adaptive speech recognition. Recently, internal language model estimation (ILME)-based LM fusion has shown significant word error rate (WER) reduction from Shallow Fusion by subtracting a weighted internal LM score from an interpolation of E2E model and external LM scores during beam search. However, on different test sets, the optimal LM interpolation weights vary over a wide range and have to be tuned extensively on well-matched validation sets. In this work, we perform LM fusion in the minimum WER (MWER) training of an E2E model to obviate the need for LM weights tuning during inference. Besides MWER training with Shallow Fusion (MWER-SF), we propose a novel MWER training with ILME (MWER-ILME) where the ILME-based fusion is conducted to generate N-best hypotheses and their posteriors. Additional gradient is induced when internal LM is engaged in MWER-ILME loss computation. During inference, LM weights pre-determined in MWER training enable robust LM integrations on test sets from different domains. Experimented with 30K-hour trained transformer transducers, MWER-ILME achieves on average 8.8% and 5.8% relative WER reductions from MWER and MWER-SF training, respectively, on 6 different test sets.

show abstract

Efficient minimum word error rate training of RNN-Transducer for end-to-end speech recognition

Cited by 6 publications

References 14 publications

REDAT: Accent-Invariant Representation for End-To-End ASR by Domain Adversarial Training with Relabeling

REDAT: Accent-Invariant Representation for End-To-End ASR by Domain Adversarial Training with Relabeling

A Likelihood Ratio based Domain Adaptation Method for E2E Models

Minimum Word Error Rate Training with Language Model Fusion for End-to-End Speech Recognition

Contact Info

Product

Resources

About