2021
DOI: 10.48550/arxiv.2104.04487
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Language model fusion for streaming end to end speech recognition

Rodrigo Cabrera,
Xiaofeng Liu,
Mohammadreza Ghodsi
et al.

Abstract: Streaming processing of speech audio is required for many contemporary practical speech recognition tasks. Even with the large corpora of manually transcribed speech data available today, it is impossible for such corpora to cover adequately the long tail of linguistic content that's important for tasks such as open-ended dictation and voice search. We seek to address both the streaming and the tail recognition challenges by using a language model (LM) trained on unpaired text data to enhance the end-to-end (E… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
4
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
2
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(4 citation statements)
references
References 17 publications
0
4
0
Order By: Relevance
“…The contextual fusion layer integrates information from the entity models into the pre-trained LM by linearly interpolating their final layer output representations based on the utterance context 4 . The fusion layer represents the entity models and the pre-trained LM using a class embedding matrix W c ∈ R N +1×d , where N is the number of entity models.…”
Section: Contextual Fusion Layermentioning
confidence: 99%
See 2 more Smart Citations
“…The contextual fusion layer integrates information from the entity models into the pre-trained LM by linearly interpolating their final layer output representations based on the utterance context 4 . The fusion layer represents the entity models and the pre-trained LM using a class embedding matrix W c ∈ R N +1×d , where N is the number of entity models.…”
Section: Contextual Fusion Layermentioning
confidence: 99%
“…Our language models can be directly used for shallow fusion [38,4] and n-best hypothesis rescoring [14,36] in seq2seq based speech recognition systems. Our approach can also be extended to integrate entity models directly into the decoders of these systems.…”
Section: Future Workmentioning
confidence: 99%
See 1 more Smart Citation
“…End-to-end ASR models [38], as opposed to traditional Gaussian mixture models, have been increasingly gaining popularity since end-to-end models consist of less componentshence, reducing maintenance costs. However, integration of external LMs into [5,21,29], and personalization of [11,33,34], end-toend systems remains an active research area. With respect to LM, Neural Network LMs (NNLM) [1] have gained popularity within ASR [12,30,43].…”
Section: Beyond Irmentioning
confidence: 99%