Despite their immense popularity, deep learning-based acoustic systems are inherently vulnerable to adversarial attacks, wherein maliciously crafted audios trigger target systems to misbehave. In this paper, we present SA, a new class of attacks to generate adversarial audios. Compared with existing attacks, SA highlights with a set of signicant features: (i) versatile-it is able to deceive a range of end-to-end acoustic systems under both white-box and black-box settings; (ii) eective-it is able to generate adversarial audios that can be recognized as specic phrases by target acoustic systems; and (iii) stealthy-it is able to generate adversarial audios indistinguishable from their benign counterparts to human perception. We empirically evaluate SA on a set of state-of-the-art deep learning-based acoustic systems (including speech command recognition, speaker recognition and sound event classication), with results showing the versatility, eectiveness, and stealthiness of SA. For instance, it achieves 99.45% attack success rate on the IEMOCAP dataset against the ResNet18 model, while the generated adversarial audios are also misinterpreted by multiple popular ASR platforms, including Google Cloud Speech, Microsoft Bing Voice, and IBM Speech-to-Text. We further evaluate three potential defense methods to mitigate such attacks, which leads to promising directions for further research.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.