ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9053600
|View full text |Cite
|
Sign up to set email alerts
|

Hybrid Autoregressive Transducer (HAT)

Abstract: This paper proposes and evaluates the hybrid autoregressive transducer (HAT) model, a time-synchronous encoderdecoder model that preserves the modularity of conventional automatic speech recognition systems. The HAT model provides a way to measure the quality of the internal language model that can be used to decide whether inference with an external language model is beneficial or not. This article also presents a finite context version of the HAT model that addresses the exposure bias problem and significant… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

2
87
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
6
2
1

Relationship

1
8

Authors

Journals

citations
Cited by 119 publications
(97 citation statements)
references
References 45 publications
2
87
0
Order By: Relevance
“…Density ratio (DR) LM fusion [13] is a shallow fusion technique that combines two language models: an external LM trained on a target domain corpus and a language model trained on the acoustic transcripts (source domain) only. The latter is used to subtract the effect of the intrinsic LM given by the prediction network (idea further developed in [14]). Decoding using DR fusion is done according to:…”
Section: Training and Decoding Recipementioning
confidence: 99%
See 1 more Smart Citation
“…Density ratio (DR) LM fusion [13] is a shallow fusion technique that combines two language models: an external LM trained on a target domain corpus and a language model trained on the acoustic transcripts (source domain) only. The latter is used to subtract the effect of the intrinsic LM given by the prediction network (idea further developed in [14]). Decoding using DR fusion is done according to:…”
Section: Training and Decoding Recipementioning
confidence: 99%
“…This led to a rapidly evolving research landscape in end-to-end modeling for ASR with Recurrent Neural Network Transducers (RNN-T) [1] and attention-based models [2,3] being the most prominent examples. Attention based models are excellent at handling non-monotonic alignment problems such as translation [4], whereas RNN-Ts are an ideal match for the left-to-right nature of speech [5][6][7][8][9][10][11][12][13][14][15][16][17].…”
Section: Introductionmentioning
confidence: 99%
“…While it is well grounded with Bayes' rule, the density ratio method requires the training of two separate LMs, from the training and target data respectively. Variani et al [283] proposed a hybrid autoregressive transducer (HAT) model to improve the RNN-T model. The HAT model builds a training set LM internally and the label distribution is derived by normalizing the score functions across all labels excluding blank.…”
Section: Language Model Adaptationmentioning
confidence: 99%
“…Zhang et al [6] investigate the impact of varying label context in the transformer-transducer model (RNN-T which replaces LSTMs with transformer networks [18]) finding that a context of 3-4 previous graphemes achieves similar performance as a full-context baseline on the Librispeech dataset. Finally, Variani et al [19] find that the hybrid autoregressive transducer (HAT; RNN-T with an 'internal' language model (LM)), trained to output phonemes and decoded with a separate lexicon and grammar achieves similar performance when context is limited to two previous phoneme labels on a large scale task. 1 Our work differs from the previously mentioned works in two ways.…”
Section: Introductionmentioning
confidence: 99%