“…End-to-end models have become a popular choice for speech recognition, thanks to both the simplicity of building them and their superior performance over conventional systems [3,4,5,6,7,8,9,10,11,12,1,2]. In contrast to conventional systems, which are comprised of separate acoustic, pronunciation, and language modeling components, end-to-end approaches formulate the speech recognition problem directly as a mapping from utterances to transcripts, which greatly simplifies the training and decoding processes.…”