Real-Time, Universal, and Robust Adversarial Attacks Against Speaker Recognition Systems

Xie, Yi; Shi, Cong; Li, Zhuohang; Liu, Jian; Chen, Yingying; Yuan, Bo

doi:10.1109/icassp40776.2020.9053747

Cited by 80 publications

(57 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The VoxCeleb dataset [22] contains speech from 7,363 speakers of multiple ethnicities, accents, occupations and age groups. Among these, for our experiments, we randomly chose 250 speakers (for consistency with the state-of-the-art [10,23,24]): 125 female speakers, and 125 male speakers. All audio files were downsampled to 8 kHz to match the sampling rate of our pre-trained speaker identification model.…”

Section: Voxceleb Datasetmentioning

confidence: 99%

FoolHD: Fooling Speaker Identification by Highly Imperceptible Adversarial Disturbances

Shamsabadi

Teixeira

Abad

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Speaker identification models are vulnerable to carefully designed adversarial perturbations of their input signals that induce misclassification. In this work, we propose a white-box steganographyinspired adversarial attack that generates imperceptible adversarial perturbations against a speaker identification model. Our approach, FoolHD, uses a Gated Convolutional Autoencoder that operates in the DCT domain and is trained with a multi-objective loss function, to generate and conceal the adversarial perturbation within the original audio files. In addition to hindering speaker identification performance, this multi-objective loss accounts for human perception through a frame-wise cosine similarity between MFCC feature vectors extracted from the original and adversarial audio files. We validate the effectiveness of FoolHD with a 250-speaker identification x-vector network, trained using VoxCeleb, in terms of accuracy, success rate, and imperceptibility. Our results show that FoolHD generates highly imperceptible adversarial audio files (average PESQ scores above 4.30), while achieving a success rate of 99.6% and 99.2% in misleading the speaker identification model, for untargeted and targeted settings, respectively.

show abstract

Section: Voxceleb Datasetmentioning

confidence: 99%

FoolHD: Fooling Speaker Identification by Highly Imperceptible Adversarial Disturbances

Shamsabadi

Teixeira

Abad

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…In the first step, we maximize the attack effect on the ASV model which is similar to the method in [13]. We make δ be effective to lead a targeted attack on the ASV model regardless of the content of input x. N audios of the adversary are collected to form a training set X = {x 1 , x 2 , ..., x N } where each x i contains different text contents.…”

Section: Attack On the Asv Modelmentioning

confidence: 99%

“…It provides a baseline WER to measure the distortion on speech recognition caused by the adversarial perturbations. As mentioned in Section 3.1 our first step is very similar to the method used in [13]. Hence we used the adversarial perturbation δ * 1 generated in the first step as the baseline method.…”

Section: Evaluation Of Digital Attacksmentioning

confidence: 99%

“…But their adversarial examples will be rejected in the PSV system for audio replay or different speech content. Studies [12,13] crafted universal adversarial perturbations that were text-independent and could launch attack in real time. But it is not proved that their perturbations could not affect the speech content recognition and remained effective after being played separately over the air to pass the audio replay detection.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Attack on Practical Speaker Verification System Using Universal Adversarial Perturbations

Zhang

Zhao

Liu³

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

In authentication scenarios, applications of practical speaker verification systems usually require a person to read a dynamic authentication text. Previous studies played an audio adversarial example as a digital signal to perform physical attacks, which would be easily rejected by audio replay detection modules. This work shows that by playing our crafted adversarial perturbation as a separate source when the adversary is speaking, the practical speaker verification system will misjudge the adversary as a target speaker. A two-step algorithm is proposed to optimize the universal adversarial perturbation to be text-independent and has little effect on the authentication text recognition. We also estimated room impulse response (RIR) in the algorithm which allowed the perturbation to be effective after being played over the air. In the physical experiment, we achieved targeted attacks with success rate of 100%, while the word error rate (WER) on speech recognition was only increased by 3.55%. And recorded audios could pass replay detection for the live person speaking.

show abstract

“…The feasibility of inaudible audio attacks in speech recognition systems, such as Siri and Alexa, was proven in [ 25 ]. Speaker recognition systems based on deep neural networks are also vulnerable to adversarial attacks, as shown by the high attack success rate of over 90% [ 26 ].…”

Section: Related Workmentioning

confidence: 99%

Non-Invasive Challenge Response Authentication for Voice Transactions with Smart Home Behavior

Hayashi

Ruggiero

2020

Sensors

View full text Add to dashboard Cite

Smart speakers, such as Alexa and Google Home, support daily activities in smart home environments. Even though voice commands enable friction-less interactions, existing financial transaction authorization mechanisms hinder usability. A non-invasive authorization by leveraging presence and light sensors’ data is proposed in order to replace invasive procedure through smartphone notification. The Coloured Petri Net model was created for synthetic data generation, and one month data were collected in test bed with real users. Random Forest machine learning models were used for smart home behavior information retrieval. The LSTM prediction model was evaluated while using test bed data, and an open dataset from CASAS. The proposed authorization mechanism is based on Physical Unclonable Function usage as a random number generator seed in a Challenge Response protocol. The simulations indicate that the proposed scheme with specialized autonomous device could halve the total response time for low value financial transactions triggered by voice, from 7.3 to 3.5 s in a non-invasive manner, maintaining authorization security.

show abstract

Real-Time, Universal, and Robust Adversarial Attacks Against Speaker Recognition Systems

Cited by 80 publications

References 14 publications

FoolHD: Fooling Speaker Identification by Highly Imperceptible Adversarial Disturbances

FoolHD: Fooling Speaker Identification by Highly Imperceptible Adversarial Disturbances

Attack on Practical Speaker Verification System Using Universal Adversarial Perturbations

Non-Invasive Challenge Response Authentication for Voice Transactions with Smart Home Behavior

Contact Info

Product

Resources

About