“…Representative models include streaming models such as the recurrent neural network transducer (RNN-T) [1], attention-based models [8,2,3], and transformer-based models [9,10,11,12]. Compared to sophisticated conventional models [13,14], E2E models such as RNN-T and Listen, Attend and Spell (LAS) have shown competitive performance [6,5,7,15]. To further improve recognition accuracy, a two-pass LAS rescoring model has been proposed in [16], which uses a nonstreaming LAS decoder to rescore the RNN-T hypotheses.…”