2021
DOI: 10.48550/arxiv.2106.02302
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Minimum Word Error Rate Training with Language Model Fusion for End-to-End Speech Recognition

Abstract: Integrating external language models (LMs) into end-to-end (E2E) models remains a challenging task for domain-adaptive speech recognition. Recently, internal language model estimation (ILME)-based LM fusion has shown significant word error rate (WER) reduction from Shallow Fusion by subtracting a weighted internal LM score from an interpolation of E2E model and external LM scores during beam search. However, on different test sets, the optimal LM interpolation weights vary over a wide range and have to be tune… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2

Citation Types

0
2
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 54 publications
(69 reference statements)
0
2
0
Order By: Relevance
“…Text data for training the language model is often scraped from typed search logs to achieve the best domain match to smart assistant voice queries. These logs can be very large [2,4], making it prohibitively expensive to make even a single epoch through the data, limiting rare-word exposure. Furthermore, search queries can be heavy-headed, meaning that they contain disproportionately many high-frequency queries relative to low-frequency queries, also limiting rareword learning.…”
Section: Introductionmentioning
confidence: 99%
“…Text data for training the language model is often scraped from typed search logs to achieve the best domain match to smart assistant voice queries. These logs can be very large [2,4], making it prohibitively expensive to make even a single epoch through the data, limiting rare-word exposure. Furthermore, search queries can be heavy-headed, meaning that they contain disproportionately many high-frequency queries relative to low-frequency queries, also limiting rareword learning.…”
Section: Introductionmentioning
confidence: 99%
“…All comparisons are made in decode-time manner, thus the works like [13,16,19,20] that modify the architecture or training objective of RNN-T are not included in this work. It should also be clarified that, when discussing "cross-domain", we assume the source domain and target domain are matched in acoustics, otherwise Eq.…”
Section: Introductionmentioning
confidence: 99%