Analyzing the intricate nature of an audio signal often requires the extraction of relevant features, which serve as informative descriptors of the signal. It entails studying the signal and determining how signals are related to one another. As a result, the performance of audio spoofing detection in Automatic Speaker Verification (ASV) systems is strongly reliant on front-end feature extraction. In this paper, three types of successively integrated features have been proposed. First, Acoustic Ternary Pattern (ATP) image features are sequentially fused with different audio features such as MFCC, CQCC, GTCC, BFCC and PLP, individually. Second, LBP image features are combined with all these audio features similarly. Then, the sequential integration of ATP-LBP features is combined individually with MFCC, CQCC, GTCC, BFCC and PLP features. Finally, these front-end hybrid feature sets are classified using different ML and deep learning algorithms based acoustic models at the back-end. The state-of-the-art ASVspoof 2019 dataset has been used to implement various front-end and back-end combinations. The research outcomes reveal that the proposed approach achieved the best results with ATP-LBP-GTCC at the front end with LSTM-based acoustic model at the back-end.
The ability to distinguish between authentic and fake audio is become increasingly difficult due to the increasing accuracy of text-to-speech models, posing a serious threat to speaker verification systems. Furthermore, audio deepfakes are becoming a more likely source of deception with the development of sophisticated methods for producing synthetic voice. The ASVspoof dataset has recently been used extensively in research on the detection of audio deep fakes, together with a variety of machine and deep learning methods. The proposed work in this paper combines data augmentation techniques with hybrid feature extraction method at front-end. Two variants of audio augmentation method and Synthetic Minority Over Sampling Technique (SMOTE) have been used, which have been combined individually with Mel Frequency Cepstral Coefficients (MFCC), Gammatone Cepstral Coefficients (GTCC) and hybrid these two feature extraction methods for implementing front-end feature extraction. To implement the back-end our proposed work two deep learning models, Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU), and two Machine Learning (ML) classifier Random Forest (RF) and Support Vector Machine (SVM) have been used. For training, and evaluation ASVspoof 2019 Logical Access (LA) partition, and for testing of the said systems, and ASVspoof 2021 deep fake partition have been used. After analysing the results, it can be observed that combination of MFCC+GTCC with SMOTE at front-end and LSTM at back-end has outperformed all other models with 99% test accuracy, and 1.6 % Equal Error Rate (EER) over deepfake partition. Also, the testing of this best combination has been done on DEepfake CROss-lingual (DECRO) dataset. To access the effectiveness of proposed model under noisy scenarios, we have analysed our best model under noisy condition by adding Babble Noise, Street Noise and Car Noise to test data.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.