Imperio: Robust Over-the-Air Adversarial Examples for Automatic Speech Recognition Systems

Schönherr, Lea; Eisenhofer, Thorsten; Zeiler, Steffen; Holz, Thorsten; Kolossa, Dorothea

doi:10.48550/arxiv.1908.01551

Cited by 6 publications

(12 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this work, we propose VENOMAVE, the first clean-label data poisoning attack against ASR systems. Other than current adversarial attacks on ASR systems [8,24,25] which target the system during inference (i.e., the attacker creates malicious input that causes a misclassification), data poisoning attacks target the system during the training phase. Such poisoning attacks were already shown to be viable against image classification, but to the best of our knowledge, no data poisoning attack was yet proposed against ASR systems.…”

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

VenoMave: Targeted Poisoning Against Speech Recognition

Aghakhani¹,

Schönherr²,

Eisenhofer³

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

In the past few years, we observed a wide adoption of practical systems that use Automatic Speech Recognition (ASR) systems to improve human-machine interaction. Modern ASR systems are based on neural networks and prior research demonstrated that these systems are susceptible to adversarial examples, i.e., malicious audio inputs that lead to misclassification by the victim's network during the system's run time. The research question if ASR systems are also vulnerable to data poisoning attacks is still unanswered. In such an attack, a manipulation happens during the training phase of the neural network: an adversary injects malicious inputs into the training set such that the neural network's integrity and performance are compromised.In this paper, we present the first data poisoning attack in the audio domain, called VENOMAVE. Prior work in the image domain demonstrated several types of data poisoning attacks, but they cannot be applied to the audio domain. The main challenge is that we need to attack a time series of inputs. To enforce a targeted misclassification in an ASR system, we need to carefully generate a specific sequence of disturbed inputs for the target utterance, which will eventually be decoded to the desired sequence of words. More specifically, the adversarial goal is to produce a series of misclassification tasks and in each of them, we need to poison the system to misrecognize each frame of the target file. To demonstrate the practical feasibility of our attack, we evaluate VENOMAVE on an ASR system that detects sequences of digits from 0 to 9. When poisoning only 0.94% of the dataset on average, we achieve an attack success rate of 83.33%. We conclude that data poisoning attacks against ASR systems represent a real threat that needs to be considered.

show abstract

Section: Methodsmentioning

confidence: 99%

“…While ASR systems have become ever more reliable on clean data, they are still susceptible to malicious input, i.e. adversarial examples [2,8,24,25]. In these evasion attacks, a targeted audio file is perturbed by imperceptible amounts of adversarial noise at run time to trigger a misclassification of the victim's neural network.…”

Section: Introductionmentioning

confidence: 99%

VenoMave: Targeted Poisoning Against Speech Recognition

Aghakhani¹,

Schönherr²,

Eisenhofer³

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…As discussed in Section II, transmission channel is a major concern when conducting physical attacks against ASR systems. During the past few years, many related works [5], [16], [20], [26], [28], [40], [65], [85], [95], [109], [134], [144] have emerged to enhance the robustness of the crafted audio adversarial examples in the physical space by exploiting transmission channel.…”

Section: A Targeting Transmission Channelmentioning

confidence: 99%

SoK: A Modularized Approach to Study the Security of Automatic Speech Recognition Systems

Chen¹,

Zhang²,

Yuan³

et al. 2021

Preprint

View full text Add to dashboard Cite

With the wide use of Automatic Speech Recognition (ASR) in applications such as human machine interaction, simultaneous interpretation, audio transcription, etc., its security protection becomes increasingly important. Although recent studies have brought to light the weaknesses of popular ASR systems that enable out-of-band signal attack, adversarial attack, etc., and further proposed various remedies (signal smoothing, adversarial training, etc.), a systematic understanding of ASR security (both attacks and defenses) is still missing, especially on how realistic such threats are and how general existing protection could be. In this paper, we present our systematization of knowledge for ASR security and provide a comprehensive taxonomy for existing work based on a modularized workflow. More importantly, we align the research in this domain with that on security in Image Recognition System (IRS), which has been extensively studied, using the domain knowledge in the latter to help understand where we stand in the former. Generally, both IRS and ASR are perceptual systems. Their similarities allow us to systematically study existing literature in ASR security based on the spectrum of attacks and defense solutions proposed for IRS, and pinpoint the directions of more advanced attacks and the directions potentially leading to more effective protection in ASR. In contrast, their differences, especially the complexity of ASR compared with IRS, help us learn unique challenges and opportunities in ASR security. Particularly, our experimental study shows that transfer attacks across ASR models is feasible, even in the absence of knowledge about models (even their types) and training data.

show abstract

“…This is referred to as a targeted attack and such an adversarial audio waveform may be 99.9% similar to a benign sample (Carlini & Wagner, 2018). Also, recent work (Schönherr et al, 2019;Qin et al, 2019;Yakura & Sakuma, 2018) has demonstrated the feasibility of these adversarial samples being played over-the-air by simulating room impulse responses and making them robust to reverberations. We observe that the key differentiation between generating adversarial examples across different tasks or input modalities such as images, audio or text lies in a change of architecture as these attacks generally attempt to maximize the training loss and it is valuable to study properties of adversarial examples that hold across multiple domains.…”

Section: Introductionmentioning

confidence: 99%

Identifying Audio Adversarial Examples via Anomalous Pattern Detection

Akinwande,

Cintas,

Speakman

et al. 2020

Preprint

View full text Add to dashboard Cite

Audio processing models based on deep neural networks are susceptible to adversarial attacks even when the adversarial audio waveform is 99.9% similar to a benign sample. Given the wide application of DNN-based audio recognition systems, detecting the presence of adversarial examples is of high practical relevance. By applying anomalous pattern detection techniques in the activation space of these models, we show that 2 of the recent and current state-of-the-art adversarial attacks on audio processing systems systematically lead to higher-than-expected activation at some subset of nodes and we can detect these with up to an AUC of 0.98 with no degradation in performance on benign samples.

show abstract

Imperio: Robust Over-the-Air Adversarial Examples for Automatic Speech Recognition Systems

Cited by 6 publications

References 11 publications

VenoMave: Targeted Poisoning Against Speech Recognition

VenoMave: Targeted Poisoning Against Speech Recognition

SoK: A Modularized Approach to Study the Security of Automatic Speech Recognition Systems

Identifying Audio Adversarial Examples via Anomalous Pattern Detection

Contact Info

Product

Resources

About