Improving Tail Performance of a Deliberation E2E ASR Model Using a Large Text Corpus

Peyser, Cal; Sepand, Mavandadi,; Sainath, Tara N.; Apfel, James; Pang, Ruoming; Kumar, Shankar

doi:10.21437/interspeech.2020-1465

Cited by 29 publications

(10 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Tuning the LM weights on multiple development sets is computationally expensive and time-consuming. To eliminate the weights tuning, the MWER training with LM fusion was proposed in [175,176] where the LM fusion is performed during MWER training. During inference, LM weights pre-set in training enables a robust LM fusion on test sets from different domains.…”

Section: B) Domain Adaptationmentioning

confidence: 99%

Recent Advances in End-to-End Automatic Speech Recognition

Li¹

2021

Preprint

View full text Add to dashboard Cite

Recently, the speech community is seeing a significant trend of moving from deep neural network based hybrid modeling to end-to-end (E2E) modeling for automatic speech recognition (ASR). While E2E models achieve the state-of-the-art results in most benchmarks in terms of ASR accuracy, hybrid models are still used in a large proportion of commercial ASR systems at the current time. There are lots of practical factors that affect the production model deployment decision. Traditional hybrid models, being optimized for production for decades, are usually good at these factors. Without providing excellent solutions to all these factors, it is hard for E2E models to be widely commercialized. In this paper, we will overview the recent advances in E2E models, focusing on technologies addressing those challenges from the industry's perspective.

show abstract

Section: B) Domain Adaptationmentioning

confidence: 99%

Recent Advances in End-to-End Automatic Speech Recognition

Li¹

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…To improve Shallow Fusion, a Density Ratio method [12,13] and an internal LM estimation-based Fusion [14,15] were proposed in which a source-domain LM score and an internal LM score are subtracted from the Shallow Fusion score, respectively, during inference. Minimum word error rate training with LM fusion [16,17,18,19] was conducted to obviate the need for LM weights tuning. Furthermore, structural LM fusion methods such as Deep Fusion [11], Cold Fusion [20,21] and Simple Fusion [22] jointly train an E2E model with an external LM to learn a sophisticated combination between the two models.…”

Section: Introductionmentioning

confidence: 99%

“…Other solutions include synthesizing speech from the text-only adaptation data using a text-to-speech (TTS) model, and then finetuning the E2E model with the synthesized audio-transcript pairs [23,19,24]. During inference, only the adapted E2E model is needed.…”

Section: Introductionmentioning

confidence: 99%

Internal Language Model Adaptation with Text-Only Data for End-to-End Speech Recognition

Meng¹,

Gaur²,

Kanda³

et al. 2021

Preprint

View full text Add to dashboard Cite

Text-only adaptation of an end-to-end (E2E) model remains a challenging task for automatic speech recognition (ASR). Language model (LM) fusion-based approaches require an additional external LM during inference, significantly increasing the computation cost.To overcome this, we propose an internal LM adaptation (ILMA) of the E2E model using text-only data. Trained with audio-transcript pairs, an E2E model implicitly learns an internal LM that characterizes the token sequence probability which is approximated by the E2E model output after zeroing out the encoder contribution. During ILMA, we fine-tune the internal LM, i.e., the E2E components excluding the encoder, to minimize a cross-entropy loss. To make ILMA effective, it is essential to train the E2E model with an internal LM loss besides the standard E2E loss. Furthermore, we propose to regularize ILMA by minimizing the Kullback-Leibler divergence between the output distributions of the adapted and unadapted internal LMs. ILMA is the most effective when we update only the last linear layer of the joint network. ILMA enables a fast text-only adaptation of the E2E model without increasing the run-time computational cost. Experimented with 30K-hour trained transformer transducer models, ILMA achieves up to 34.9% relative word error rate reduction from the unadapted baseline.

show abstract

“…However, all these methods require audio as the adaptation data when applied to E2E models [35,36,37]. A promising solution is to integrate an external language model (LM) into the E2E model during inference or during MWER training [38,16,39]. With no clear separation of acoustic and language models in an E2E model, LM fusion remains to be a challenging task.…”

Section: Introductionmentioning

confidence: 99%

“…We first apply Shallow Fusion to generate the N-best hypotheses and compute their posteriors for MWER training (i.e., MWER-SF). Note that our MWER-SF differs from [38] in that the E2E and LM scores are interpolated in the log domain and the combined scores are re-normalized over N-best hypotheses. Further, we propose a MWER training with ILME (MWER-ILME) in which the Nbest hypotheses are generated by an ILME-based Fusion and their posteriors are computed by the probabilities of the E2E model, internal LM and external LM.…”

Section: Introductionmentioning

confidence: 99%

Minimum Word Error Rate Training with Language Model Fusion for End-to-End Speech Recognition

Meng

Kanda

et al. 2021

Preprint

View full text Add to dashboard Cite

Integrating external language models (LMs) into end-to-end (E2E) models remains a challenging task for domain-adaptive speech recognition. Recently, internal language model estimation (ILME)-based LM fusion has shown significant word error rate (WER) reduction from Shallow Fusion by subtracting a weighted internal LM score from an interpolation of E2E model and external LM scores during beam search. However, on different test sets, the optimal LM interpolation weights vary over a wide range and have to be tuned extensively on well-matched validation sets. In this work, we perform LM fusion in the minimum WER (MWER) training of an E2E model to obviate the need for LM weights tuning during inference. Besides MWER training with Shallow Fusion (MWER-SF), we propose a novel MWER training with ILME (MWER-ILME) where the ILME-based fusion is conducted to generate N-best hypotheses and their posteriors. Additional gradient is induced when internal LM is engaged in MWER-ILME loss computation. During inference, LM weights pre-determined in MWER training enable robust LM integrations on test sets from different domains. Experimented with 30K-hour trained transformer transducers, MWER-ILME achieves on average 8.8% and 5.8% relative WER reductions from MWER and MWER-SF training, respectively, on 6 different test sets.

show abstract

Improving Tail Performance of a Deliberation E2E ASR Model Using a Large Text Corpus

Cited by 29 publications

References 20 publications

Recent Advances in End-to-End Automatic Speech Recognition

Recent Advances in End-to-End Automatic Speech Recognition

Internal Language Model Adaptation with Text-Only Data for End-to-End Speech Recognition

Minimum Word Error Rate Training with Language Model Fusion for End-to-End Speech Recognition

Contact Info

Product

Resources

About