Keyword recognition is the basis of speech recognition, and its application is rapidly increasing in keyword spotting, robotics, and smart home surveillance. Because of these advanced applications, improving the accuracy of keyword recognition is crucial. In this paper, we proposed voice conversion (VC) -based augmentation to increase the limited training dataset and a fusion of a convolutional neural network (CNN) and long-short term memory (LSTM) model for robust speaker-independent isolated keyword recognition. Collecting and preparing a sufficient amount of voice data for speaker-independent speech recognition is a tedious and bulky task. To overcome this, we generated new raw voices from the original voices using an auxiliary classifier conditional variational autoencoder (ACVAE) method. In this study, the main intention of voice conversion is to obtain numerous and various human-like keywords' voices that are not identical to the source and target speakers' pronunciation. Parallel VC was used to accurately maintain the linguistic content. We examined the performance of the proposed voice conversion augmentation techniques using robust deep neural network algorithms. Original training data, excluding generated voice using other data augmentation and regularization techniques, were considered as the baseline. The results showed that incorporating voice conversion augmentation into the baseline augmentation techniques and applying the CNN-LSTM model improved the accuracy of isolated keyword recognition.
Accent similarity evaluation and accent identification are complex and challenging tasks for various applications due to the existence of variant types of native and non-native languages in the world. The absence of existing studies for the non-native and native English accent similarity evaluation and the limitation of individual feature extraction techniques for accent classifications have led us to propose a new model termed the intra-native accent feature sharing based native accent identification (NAI) framework using an English accent archive speech dataset. The NAI network was employed for non-native English accent classification, native English accent classification, and identification of native and non-native English accents. Finally, the accent similarity of native and non-native English accents was evaluated based on a delicate NAI pre-trained model. Moreover, the proposed approach has a high role in training data augmentation to overcome the challenge of a huge amount of training datasets demands of deep learning. The ordinary individual voice feature extraction with data augmentation and regularization techniques was the baseline for our work. The proposed approach boosted the accuracy of the baseline method with an average accuracy value of 3.7% -7.5% on different vigorous deep learning algorithms. The Quade test method for the performance comparison gave a 0.01 significant level (p-value) that proved that the proposed approach performed better than the baseline significantly. The model makes the rank for non-native English accents based on their similarity to native English accents and the proximity rank is Mandarin, Italian, German, French, Amharic, and Hindi.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.