ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9053747
|View full text |Cite
|
Sign up to set email alerts
|

Real-Time, Universal, and Robust Adversarial Attacks Against Speaker Recognition Systems

Abstract: As the popularity of voice user interface (VUI) exploded in recent years, speaker recognition system has emerged as an important medium of identifying a speaker in many security-required applications and services. In this paper, we propose the first real-time, universal, and robust adversarial attack against the state-of-the-art deep neural network (DNN) based speaker recognition system. Through adding an audio-agnostic universal perturbation on arbitrary enrolled speaker's voice input, the DNN-based speaker r… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
57
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 80 publications
(57 citation statements)
references
References 14 publications
0
57
0
Order By: Relevance
“…The VoxCeleb dataset [22] contains speech from 7,363 speakers of multiple ethnicities, accents, occupations and age groups. Among these, for our experiments, we randomly chose 250 speakers (for consistency with the state-of-the-art [10,23,24]): 125 female speakers, and 125 male speakers. All audio files were downsampled to 8 kHz to match the sampling rate of our pre-trained speaker identification model.…”
Section: Voxceleb Datasetmentioning
confidence: 99%
“…The VoxCeleb dataset [22] contains speech from 7,363 speakers of multiple ethnicities, accents, occupations and age groups. Among these, for our experiments, we randomly chose 250 speakers (for consistency with the state-of-the-art [10,23,24]): 125 female speakers, and 125 male speakers. All audio files were downsampled to 8 kHz to match the sampling rate of our pre-trained speaker identification model.…”
Section: Voxceleb Datasetmentioning
confidence: 99%
“…In the first step, we maximize the attack effect on the ASV model which is similar to the method in [13]. We make δ be effective to lead a targeted attack on the ASV model regardless of the content of input x. N audios of the adversary are collected to form a training set X = {x 1 , x 2 , ..., x N } where each x i contains different text contents.…”
Section: Attack On the Asv Modelmentioning
confidence: 99%
“…It provides a baseline WER to measure the distortion on speech recognition caused by the adversarial perturbations. As mentioned in Section 3.1 our first step is very similar to the method used in [13]. Hence we used the adversarial perturbation δ * 1 generated in the first step as the baseline method.…”
Section: Evaluation Of Digital Attacksmentioning
confidence: 99%
See 1 more Smart Citation
“…The feasibility of inaudible audio attacks in speech recognition systems, such as Siri and Alexa, was proven in [ 25 ]. Speaker recognition systems based on deep neural networks are also vulnerable to adversarial attacks, as shown by the high attack success rate of over 90% [ 26 ].…”
Section: Related Workmentioning
confidence: 99%