ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019
DOI: 10.1109/icassp.2019.8683573
|View full text |Cite
|
Sign up to set email alerts
|

End-to-end Contextual Speech Recognition Using Class Language Models and a Token Passing Decoder

Abstract: End-to-end modeling (E2E) of automatic speech recognition (ASR) blends all the components of a traditional speech recognition system into a unified model. Although it simplifies training and decoding pipelines, the unified model is hard to adapt when mismatch exists between training and test data. In this work, we focus on contextual speech recognition, which is particularly challenging for E2E models because it introduces significant mismatch between training and test data. To improve the performance in the p… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
41
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
7

Relationship

1
6

Authors

Journals

citations
Cited by 39 publications
(41 citation statements)
references
References 28 publications
0
41
0
Order By: Relevance
“…This biasing can be applied at word boundaries [8], at the grapheme level [11,13,10], or at the subword level [13,14]. Given that E2E models generally used a constrained beam [16], applying biasing only at word boundaries cannot improve performance if the relevant word does not already appear in the beam.…”
Section: Previous Workmentioning
confidence: 99%
“…This biasing can be applied at word boundaries [8], at the grapheme level [11,13,10], or at the subword level [13,14]. Given that E2E models generally used a constrained beam [16], applying biasing only at word boundaries cannot improve performance if the relevant word does not already appear in the beam.…”
Section: Previous Workmentioning
confidence: 99%
“…Model The objective of this task is to generate reasonable sentences from fed-in audio utterances.The baseline model proposed for this task adopts an encoder-decoder approach, similar to that of endto-end automatic speech recognition [19,20] and image captioning [6,7] tasks. The encoder outputs a single fixed dimensional vector u for each utterance.…”
Section: Featurename Window Shift Dimensionmentioning
confidence: 99%
“…Shallow fusion [10,9] solves this by generating on-the-fly contextual LMs that are interpolated with E2E neural model's scores to bias the beam search during decoding. This method was further improved by using a token-passing decoder with efficient token recombination to minimize search errors when the number of contextual entities is large [15]. While this showed improvements over standard shallow fusion, there are still gaps between E2E and traditional modular systems in contextual ASR.…”
Section: Contextual Speech Recognition Using Shallow Fusionmentioning
confidence: 99%
“…The main network follows the LAS architecture [12], with a 2-layer BLSTM with 1400 hidden nodes per layer as the encoder and 2-layer LSTM with 700 hidden nodes as the decoder. More details can be found in [15]. We built the system with PyTorch [25] based on Espnet [26], and implemented Block-momentum SGD [27] to enable distributed training with linear speedups and no performance degradation.…”
Section: Setupmentioning
confidence: 99%