On the use of deep recurrent neural networks for detecting audio spoofing attacks

Scardapane, Simone; Stoffl, Lucas; Röhrbein, Florian; Uncini, Aurelio

doi:10.1109/ijcnn.2017.7966294

Cited by 12 publications

(4 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Additionally, some of the audio spoof detection methods have been extended by working on the features which are fed into the network (Balamurali et al, 2019 ). While others have changed the networks used or have improved both networks and features (Scardapane et al, 2017 ; Alzantot et al, 2019 ; Chintha et al, 2020 ; Rahul et al, 2020 ; Wang et al, 2020b ; Luo A. et al, 2021 ). Given the fact that one of the most important deepfake detection challenges is “generalization,” researchers are highly recommended to work on generalization by changing or improving both of the networks and features as well as defining different loss functions (Chen T. et al, 2020 ; Zhang Y. et al, 2021 ).…”

Section: Discussion and Future Directionsmentioning

confidence: 99%

“…In addition, some of the audio spoof detection systems have been extended by working on the features which are fed into the network (Balamurali et al, 2019 ). While others have worked on the networks used or both of the networks and features (Scardapane et al, 2017 ; Alzantot et al, 2019 ; Chintha et al, 2020 ; Rahul et al, 2020 ; Wang et al, 2020b ; Luo A. et al, 2021 ). Therefore, besides the modeling phase, the features which are fed to the models are really challenging in the field of audio deepfake.…”

Section: Deepfake Categoriesmentioning

confidence: 99%

See 1 more Smart Citation

Audio deepfakes: A survey

Khanjani

Watson²,

Janeja³

2023

Front. Big Data

View full text Add to dashboard Cite

A deepfake is content or material that is synthetically generated or manipulated using artificial intelligence (AI) methods, to be passed off as real and can include audio, video, image, and text synthesis. The key difference between manual editing and deepfakes is that deepfakes are AI generated or AI manipulated and closely resemble authentic artifacts. In some cases, deepfakes can be fabricated using AI-generated content in its entirety. Deepfakes have started to have a major impact on society with more generation mechanisms emerging everyday. This article makes a contribution in understanding the landscape of deepfakes, and their detection and generation methods. We evaluate various categories of deepfakes especially in audio. The purpose of this survey is to provide readers with a deeper understanding of (1) different deepfake categories; (2) how they could be created and detected; (3) more specifically, how audio deepfakes are created and detected in more detail, which is the main focus of this paper. We found that generative adversarial networks (GANs), convolutional neural networks (CNNs), and deep neural networks (DNNs) are common ways of creating and detecting deepfakes. In our evaluation of over 150 methods, we found that the majority of the focus is on video deepfakes, and, in particular, the generation of video deepfakes. We found that for text deepfakes, there are more generation methods but very few robust methods for detection, including fake news detection, which has become a controversial area of research because of the potential heavy overlaps with human generation of fake content. Our study reveals a clear need to research audio deepfakes and particularly detection of audio deepfakes. This survey has been conducted with a different perspective, compared to existing survey papers that mostly focus on just video and image deepfakes. This survey mainly focuses on audio deepfakes that are overlooked in most of the existing surveys. This article's most important contribution is to critically analyze and provide a unique source of audio deepfake research, mostly ranging from 2016 to 2021. To the best of our knowledge, this is the first survey focusing on audio deepfakes generation and detection in English.

show abstract

Section: Discussion and Future Directionsmentioning

confidence: 99%

Section: Deepfake Categoriesmentioning

confidence: 99%

Audio deepfakes: A survey

Khanjani

Watson²,

Janeja³

2023

Front. Big Data

View full text Add to dashboard Cite

show abstract

“…Recurrent Neural Networks (RNN) have significant advantages in dealing with time-dependent problems, utilizing gated structures and recurrent units to maintain the memory of time-series-related issues. Scardapane et al used an RNN network model to extract acoustic features from the ASVspoof2015 dataset and conducted experiments on forged audio, achieving an average EER of 2.91% [30]. Gomez-Alanis et al combined CNN and RNN networks to establish a CNN-RNN hybrid network model for detecting noisy forged audio [31].…”

Section: Related Workmentioning

confidence: 99%

DAMRN: Deep attention mechanism residual network for deepfake audio detection using cepstral coefficient features

Zhou,

Yang,

Yan

et al. 2024

Preprint

View full text Add to dashboard Cite

In order to further curb the misuse of Deepfake audio technology, we proposed deep attention mechanism residual network (DAMRN) that can effectively detect forged audio. The network exhibits stable operation, low risk of gradient disappearance and gradient explosion, and high detection accuracy. The structure of proposed network mainly involves the following contents: Firstly, a data balancing strategy is adopted in the front end of the network so that the ratio of positive and negative samples in the data maintained proportional balance, which improves the network performance and reduces the overfitting phenomenon. This strategy has been effectively proven by the experiments in this article. Secondly, we compare the accuracy rates of different depths among the network models for Deepfake audio detection (DFAD), and select the network that best suits the depth of this article. Finally, we introduce an effective attention mechanism in the network structure appropriately to increase the network's sensitivity to forged speech artifact information. By obtaining the artifact information of the Deepfake audio, the network model can learn more falsification frequency features that can effectively distinguish between spoofed and bonafide audio, and the accuracy has been improved to 99.81%, with the EER reduced to 0.69%, compared to the baseline system. Experiments are conducted using three acoustic features (MFCC, LFCC, GFCC) extracted from two mainstream datasets (ASVspoof2019LA, ASVspoof2021DF) respectively, and the results show that the best EER value of the method proposed in this paper is 0.32%, which is a better performance compared with other mainstream models.

show abstract

“…Analysis of the deep RNN network was presented by Scardapane et al [ 50 ]. They evaluated four architectures with MFCC features, log-filter bank features, and a concatenation of these two feature sets using ASVSpoof 2015 datasets.…”

Section: Related Workmentioning

confidence: 99%

Gaussian-Filtered High-Frequency-Feature Trained Optimized BiLSTM Network for Spoofed-Speech Classification

Mewada,

Al-Asad,

Almalki

et al. 2023

Sensors

View full text Add to dashboard Cite

Voice-controlled devices are in demand due to their hands-free controls. However, using voice-controlled devices in sensitive scenarios like smartphone applications and financial transactions requires protection against fraudulent attacks referred to as “speech spoofing”. The algorithms used in spoof attacks are practically unknown; hence, further analysis and development of spoof-detection models for improving spoof classification are required. A study of the spoofed-speech spectrum suggests that high-frequency features are able to discriminate genuine speech from spoofed speech well. Typically, linear or triangular filter banks are used to obtain high-frequency features. However, a Gaussian filter can extract more global information than a triangular filter. In addition, MFCC features are preferable among other speech features because of their lower covariance. Therefore, in this study, the use of a Gaussian filter is proposed for the extraction of inverted MFCC (iMFCC) features, providing high-frequency features. Complementary features are integrated with iMFCC to strengthen the features that aid in the discrimination of spoof speech. Deep learning has been proven to be efficient in classification applications, but the selection of its hyper-parameters and architecture is crucial and directly affects performance. Therefore, a Bayesian algorithm is used to optimize the BiLSTM network. Thus, in this study, we build a high-frequency-based optimized BiLSTM network to classify the spoofed-speech signal, and we present an extensive investigation using the ASVSpoof 2017 dataset. The optimized BiLSTM model is successfully trained with the least epoch and achieved a 99.58% validation accuracy. The proposed algorithm achieved a 6.58% EER on the evaluation dataset, with a relative improvement of 78% on a baseline spoof-identification system.

show abstract

On the use of deep recurrent neural networks for detecting audio spoofing attacks

Cited by 12 publications

References 27 publications

Audio deepfakes: A survey

Audio deepfakes: A survey

DAMRN: Deep attention mechanism residual network for deepfake audio detection using cepstral coefficient features

Gaussian-Filtered High-Frequency-Feature Trained Optimized BiLSTM Network for Spoofed-Speech Classification

Contact Info

Product

Resources

About