Building a good speech recognition system usually requires a lot of pairing data, which poses a big challenge for low-resource languages, such as Kazakh. In recent years, unsupervised pre-training has achieved good performance in low-resource speech recognition, but it is rarely used in Kazakh and other Central and West Asian languages. In this paper, wav2vec2.0 is improved by integrating a Factorized TDNN layer to better preserve the relationship between the voice and the time step before and after the quantization, which is called wav2vec-F. The unsupervised pre-training strategy was used to learn the potential speech representation from a large number of unlabeled audio data and was applied to the cross-language ASR task, which was optimized using the noise contrast binary classification task. At the same time, speech synthesis is used to promote the performance of speech recognition. The experiment shows that wav2vec-F can effectively utilize the unlabeled data from non-target languages, and the multi-language pre-training is obviously better than the single-language pre-training. The data enhancement method using speech synthesis can bring huge benefits. Compared with the baseline model, Librispeech’s test-clean dataset has an average reduction of 1.9% in the word error rate. On the Kazakh KSC test set, the pre-training using only Kazakh reduced the word error rate by 3.8%. The pre-training of a small amount of Kazakh speech data synthesized by multi-language combined with TTS achieved a word error rate of 8.6% on the KSC test set when the labeled data were only 10 h, which was comparable to the results of the previous end-to-end model when the labeled data were 30 times less.
In recent years, the end-to-end speech recognition model has emerged as a popular alternative to the traditional Deep Neural Network—Hidden Markov Model (DNN-HMM). This approach maps acoustic features directly onto text sequences via a single network architecture, significantly streamlining the model construction process. However, the training of end-to-end speech recognition models typically necessitates a significant quantity of supervised data to achieve good performance, which poses a challenge in low-resource conditions. The use of unsupervised representation significantly reduces this necessity. Recent research has focused on end-to-end techniques employing joint Connectionist Temporal Classification (CTC) and attention mechanisms, with some also concentrating on unsupervised presentation learning. This paper proposes a joint supervised and unsupervised multi-task learning model (JSUM). Our approach leverages the unsupervised pre-trained wav2vec 2.0 model as a shared encoder that integrates the joint CTC-Attention network and the generative adversarial network into a unified end-to-end architecture. Our method provides a new low-resource language speech recognition solution that optimally utilizes supervised and unsupervised datasets by combining CTC, attention, and generative adversarial losses. Furthermore, our proposed approach is suitable for both monolingual and cross-lingual scenarios.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.