Interspeech 2020 2020
DOI: 10.21437/interspeech.2020-1465
|View full text |Cite
|
Sign up to set email alerts
|

Improving Tail Performance of a Deliberation E2E ASR Model Using a Large Text Corpus

Abstract: End-to-end (E2E) automatic speech recognition (ASR) systems lack the distinct language model (LM) component that characterizes traditional speech systems. While this simplifies the model architecture, it complicates the task of incorporating textonly data into training, which is important to the recognition of tail words that do not occur often in audio-text pairs. While shallow fusion has been proposed as a method for incorporating a pre-trained LM into an E2E model at inference time, it has not yet been expl… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
10
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
6
2

Relationship

0
8

Authors

Journals

citations
Cited by 29 publications
(10 citation statements)
references
References 20 publications
0
10
0
Order By: Relevance
“…Tuning the LM weights on multiple development sets is computationally expensive and time-consuming. To eliminate the weights tuning, the MWER training with LM fusion was proposed in [175,176] where the LM fusion is performed during MWER training. During inference, LM weights pre-set in training enables a robust LM fusion on test sets from different domains.…”
Section: B) Domain Adaptationmentioning
confidence: 99%
“…Tuning the LM weights on multiple development sets is computationally expensive and time-consuming. To eliminate the weights tuning, the MWER training with LM fusion was proposed in [175,176] where the LM fusion is performed during MWER training. During inference, LM weights pre-set in training enables a robust LM fusion on test sets from different domains.…”
Section: B) Domain Adaptationmentioning
confidence: 99%
“…To improve Shallow Fusion, a Density Ratio method [12,13] and an internal LM estimation-based Fusion [14,15] were proposed in which a source-domain LM score and an internal LM score are subtracted from the Shallow Fusion score, respectively, during inference. Minimum word error rate training with LM fusion [16,17,18,19] was conducted to obviate the need for LM weights tuning. Furthermore, structural LM fusion methods such as Deep Fusion [11], Cold Fusion [20,21] and Simple Fusion [22] jointly train an E2E model with an external LM to learn a sophisticated combination between the two models.…”
Section: Introductionmentioning
confidence: 99%
“…Other solutions include synthesizing speech from the text-only adaptation data using a text-to-speech (TTS) model, and then finetuning the E2E model with the synthesized audio-transcript pairs [23,19,24]. During inference, only the adapted E2E model is needed.…”
Section: Introductionmentioning
confidence: 99%
“…However, all these methods require audio as the adaptation data when applied to E2E models [35,36,37]. A promising solution is to integrate an external language model (LM) into the E2E model during inference or during MWER training [38,16,39]. With no clear separation of acoustic and language models in an E2E model, LM fusion remains to be a challenging task.…”
Section: Introductionmentioning
confidence: 99%
“…We first apply Shallow Fusion to generate the N-best hypotheses and compute their posteriors for MWER training (i.e., MWER-SF). Note that our MWER-SF differs from [38] in that the E2E and LM scores are interpolated in the log domain and the combined scores are re-normalized over N-best hypotheses. Further, we propose a MWER training with ILME (MWER-ILME) in which the Nbest hypotheses are generated by an ILME-based Fusion and their posteriors are computed by the probabilities of the E2E model, internal LM and external LM.…”
Section: Introductionmentioning
confidence: 99%