Even though speaker recognition has gained significant progress in recent years, its performance is known to be deteriorated severely with the existence of strong background noises. Inspired by a recently proposed clean-frame selection approach, this work investigates a relatively elegant weighting method when computing the Baum-Welch statistics of Gaussian mixture models (GMMs) in i-vector extraction. By introducing weighting parameters to the frames of enrollment/testing utterances, the optimization problem is redefined and solved. New updating rules are derived by incorporating weights to the computation of posterior probabilities, mean vectors, and covariance matrices of the GMM. The experiments conducted on the Speakers in the Wild (SITW) database show that the proposed algorithm has significantly improved the performance of i-vector-based speaker recognition systems in noisy environments. Compared with the GMM i-vector baseline, the equal error rate is reduced from 5.75 to 4.72 and the minimum value of cost function (C min det) is reduced from 0.4825 to 0.4505. Slight but significant superiority is also observed over the method with an additional feature enhancement frontend by using deep neural networks. INDEX TERMS Gaussian mixture models, frame weighting, Baum-Welch statistics, i-vector, robust speaker recognition.
Automatic speaker recognition is an important biometric authentication approach with emerging applications. However, recent research has shown its vulnerability on adversarial attacks. In this paper, we propose a new type of adversarial examples by generating imperceptible adversarial samples for targeted attacks on black-box systems of automatic speaker recognition. Waveform samples are created directly by solving an optimization problem with waveform inputs and outputs, which is more realistic in real-life scenario. Inspired by auditory masking, a regularization term adapting to the energy of speech waveform is proposed for generating imperceptible adversarial perturbations. The optimization problems are subsequently solved by differential evolution algorithm in a black-box manner which does not require any knowledge on the inner configuration of the recognition systems. Experiments conducted on commonly used data sets, LibriSpeech and VoxCeleb, show that the proposed methods have successfully performed targeted attacks on state-of-the-art speaker recognition systems while being imperceptible to human listeners. Given the high SNR and PESQ scores of the yielded adversarial samples, the proposed methods deteriorate less on the quality of the original signals than several recently proposed methods, which justifies the imperceptibility of adversarial samples.
The security of unprotected automatic speaker verification (ASV) system is vulnerable to a variety of spoofing attacks where an attacker (adversary) disguises him/herself as a specific targeted user. It is a common practice to use spoofing countermeasure (CM) to improve the security of ASV systems so as to avoid illegal access. However, recent studies have shown that both ASV and CM systems are vulnerable to adversarial attacks. Previous researches mainly focus on adversarial attacks on a single ASV or CM system. But in practical scenarios, ASVs are typically deployed in conjunction with CM. In this paper, we investigate attacking the tandem system of ASV and CM with adversarial examples. The joint objective function is designed to restrict the generating process of adversarial examples. The joint gradient of the ASV and CM system is derived to generate adversarial examples. Fast Gradient Sign Method (FSGM) and Projected Gradient Descent (PGD) are utilized to study the vulnerability of tandem verification systems against white-box adversarial attacks. Through our attack, audio samples whose original labels are spoof or nontarget can be successfully accepted by the tandem system. Experimental results on the ASVSpoof2019 dataset show that the tandem system is vulnerable to our proposed attack.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.