“…For text-only data, text is mainly used to train an external language model (LM) for joint decoding [11,12,13,14,15]. In order to make use of both unpaired speech and text, many methods have recently been proposed, e.g., integration of a pre-trained acoustic model and LM [16,17,18,19], cycle-consistency based dual-training [20,21,22,23], and shared representation learning [24,25,26,27], which rely on hybrid models with multitask training and some of which become less effective in cases with a very limited amount of labeled data. The current mainstream methods that achieve state-of-the-art (SOTA) results in low-resource ASR use unpaired speech and text for pre-training and training a LM for joint decoding, respectively [7,8], and adopt an additional iterative self-training [28].…”