Perception-Aware Attack

Duan, Rui; Qu, Zhe; Zhao, Shangqing; Ding, Leah; Liu, Yao; Lu, Zhuo

doi:10.1145/3548606.3559350

Cited by 5 publications

(15 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…where y t ̸ = y is the attacker's target label; Ω is the search space for δ; D(x, x + δ) is a distance function that measures the difference between the original speech x and the perturbed speech x+δ and can be the L p norm based distance [29], [118] or a measure of auditory feature difference (e.g., qDev [44] and NISQA [113]); and ϵ limits the change from x to x + δ.…”

Section: B Adversarial Speech Attacksmentioning

confidence: 99%

“…We first need to find an appropriate perception metric to accurately measure the human perceptual quality of AEs based on different carriers. Recent studies [44], [113] have pointed out that traditional metrics, such as signal-to-noise ratio (SNR) [32] and the L p norm [114], [29], [118], cannot directly reflect the human perception. They have used different human study based metrics to measure the perceptual quality of AEs with certain types of carriers (i.e., qDev for music AEs in [44] and NISQA for feature-twisted AEs [113]).…”

Section: B Quantifying Perceptual Quality Of Speech Aesmentioning

confidence: 99%

“…Recent studies [44], [113] have pointed out that traditional metrics, such as signal-to-noise ratio (SNR) [32] and the L p norm [114], [29], [118], cannot directly reflect the human perception. They have used different human study based metrics to measure the perceptual quality of AEs with certain types of carriers (i.e., qDev for music AEs in [44] and NISQA for feature-twisted AEs [113]). In addition, we also notice that the harmonics-to-noise ratio (HNR) [115] is a common metric adopted in speech science to measure the quality of a speech signal.…”

Section: B Quantifying Perceptual Quality Of Speech Aesmentioning

confidence: 99%

“…Evaluation of speech quality metrics: Next, we evaluate the accuracy of existing metrics to characterize the speech quality based on our human study results. We compare the metrics of L 2 and L ∞ norms [114], [29], [118], SCR (equivalent to SNR [32]), HNR [115], audio-feature-regression-based qDev [44], and DNN-based NISQA [113], [82]. Note that the qDev model [44] was originally trained using music instead of speech.…”

Section: Perceptual Quality Of Different Carriersmentioning

confidence: 99%

“…In particular, we summarize the carriers into the following major types: (i) noise carriers, which are the results of traditional methods [29], [118] during their search for the perturbation signals in the unrestricted L p space. (ii) feature-twisted carriers that are perturbation signals generated by only varying the auditory features of the original signal itself [113], [44], [17], [30], (iii) environmental sound carriers that are produced by environmental sounds [39].…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Parrot-Trained Adversarial Examples: Pushing the Practicality of Black-Box Audio Attacks against Speaker Recognition Models

Duan,

Qu,

Ding

et al. 2024

Proceedings 2024 Network and Distributed System Security Symposium

View full text Add to dashboard Cite

Audio adversarial examples (AEs) have posed significant security challenges to real-world speaker recognition systems. Most black-box attacks still require certain information from the speaker recognition model to be effective (e.g., keeping probing and requiring the knowledge of similarity scores). This work aims to push the practicality of the black-box attacks by minimizing the attacker's knowledge about a target speaker recognition model. Although it is not feasible for an attacker to succeed with completely zero knowledge, we assume that the attacker only knows a short (or a few seconds) speech sample of a target speaker. Without any probing to gain further knowledge about the target model, we propose a new mechanism, called parrot training, to generate AEs against the target model. Motivated by recent advancements in voice conversion (VC), we propose to use the one short sentence knowledge to generate more synthetic speech samples that sound like the target speaker, called parrot speech. Then, we use these parrot speech samples to train a parrot-trained (PT) surrogate model for the attacker. Under a joint transferability and perception framework, we investigate different ways to generate AEs on the PT model (called PT-AEs) to ensure the PT-AEs can be generated with high transferability to a black-box target model with good human perceptual quality. Real-world experiments show that the resultant PT-AEs achieve the attack success rates of 45.8%-80.8% against the open-source models in the digital-line scenario and 47.9%-58.3% against smart devices, including Apple HomePod (Siri), Amazon Echo, and Google Home, in the over-the-air scenario.

show abstract

Section: B Adversarial Speech Attacksmentioning

confidence: 99%