Interspeech 2020 2020
DOI: 10.21437/interspeech.2020-1787
|View full text |Cite
|
Sign up to set email alerts
|

Class LM and Word Mapping for Contextual Biasing in End-to-End ASR

Abstract: In recent years, all-neural, end-to-end (E2E) ASR systems gained rapid interest in the speech recognition community. They convert speech input to text units in a single trainable Neural Network model. In ASR, many utterances contain rich named entities. Such named entities may be user or location specific and they are not seen during training. A single model makes it inflexible to utilize dynamic contextual information during inference. In this paper, we propose to train a context aware E2E model and allow the… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
32
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
8
2

Relationship

0
10

Authors

Journals

citations
Cited by 31 publications
(32 citation statements)
references
References 18 publications
0
32
0
Order By: Relevance
“…A challenge for the fusion-based biasing method is that it usually benefits from prefix which may not be available all the time. In [196], class tags are inserted into word transcription during training to enable context aware training. During inference, class tags are used to construct contextual bias finite-state transducer.…”
Section: C) Customizationmentioning
confidence: 99%
“…A challenge for the fusion-based biasing method is that it usually benefits from prefix which may not be available all the time. In [196], class tags are inserted into word transcription during training to enable context aware training. During inference, class tags are used to construct contextual bias finite-state transducer.…”
Section: C) Customizationmentioning
confidence: 99%
“…An on-the-fly rescoring mechanism was proposed to adjust the LM weights of n-grams which is relevant to the dynamic context during the decoding procedure in [20]. In [21], the class LM and word mapping algorithm were proposed to achieve the rare entity words recognition with the LAS (Listen, Attend, and Spell) [22] architecture. A shallow-fusion end-to-end biasing method [23] showed the competitive performance with the recurrent neural network transducer (RNN-T) [24] model.…”
Section: Contextual Asr Systemsmentioning
confidence: 99%
“…Since the contextual information is commonly considered as the contextual segments (n-grams, queries and entities) [8,12,14,20,21] in contextual speech recognition, the c-encoder is mainly used to extract the embeddings of these segments. Given a list of N contextual segments Z = {z1, z2, ..., zN }, the c-encoder encodes each segment to a vector.…”
Section: Context Processing Networkmentioning
confidence: 99%