ASR systems designed for native English (L1) usually underperform on non-native English (L2). To address this performance gap, (i) we extend our previous work to investigate fine-tuning of a pre-trained wav2vec 2.0 model [2,56] under a rich set of L1 and L2 training conditions. We further (ii) incorporate language model decoding in the ASR system, along with the fine-tuning method. Quantifying gains acquired from each of these two approaches separately and an error analysis allows us to identify different sources of improvement within our models. We find that while the large self-trained wav2vec 2.0 may be internalizing sufficient decoding knowledge for clean L1 speech [56], this does not hold for L2 speech and accounts for the utility of employing language model decoding on L2 data.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.