“…For the decoder network, we used a 2-layer LSTM with 300 cells. In addition to the standard decoder network, our proposed models additionally require extra parameters for gating layers in order to fuse Prior Models LF-MMI (Povey et al, 2016) CD phones N/A 9.6 19.3 CTC (Zweig et al, 2017) Char 53M 19.8 32.1 CTC (Sanabria and Metze, 2018) Char, BPE-{300,1k,10k} 26M 12.5 23.7 CTC (Audhkhasi et al, 2018) Word (Phone init.) N/A 14.6 23.6 Seq2Seq (Zeyer et al, 2018) BPE-10k 150M* 13.5 27.1 Seq2Seq (Palaskar and Metze, 2018) Word-10k N/A 23.0 37.2 Seq2Seq (Zeyer et al, 2018) BPE-1k 150M* 11.8 25.7 conversational-context embedding to the decoder network compared to baseline.…”