The proliferation of speech technologies and rising privacy legislation calls for the development of privacy preservation solutions for speech applications. These are essential since speech signals convey a wealth of rich, personal and potentially sensitive information. Anonymisation, the focus of the recent VoicePrivacy initiative, is one strategy to protect speaker identity information. Pseudonymisation solutions aim not only to mask the speaker identity and preserve the linguistic content, quality and naturalness, as is the goal of anonymisation, but also to preserve voice distinctiveness. Existing metrics for the assessment of anonymisation are ill-suited and those for the assessment of pseudonymisation are completely lacking. Based upon voice similarity matrices, this paper proposes the first intuitive visualisation of pseudonymisation performance for speech signals and two novel metrics for objective assessment. They reflect the two, key pseudonymisation requirements of de-identification and voice distinctiveness.
This paper describes two comparative studies of voice quality assessment based on complementary approaches. The first study was undertaken on 449 speakers (including 391 dysphonic patients) whose voice quality was evaluated in parallel by a perceptual judgment and objective measurements on acoustic and aerodynamic data. Results showed that a nonlinear combination of 7 parameters allowed the classification of 82% voice samples in the same grade as the jury. The second study relates to the adaptation of Automatic Speaker Recognition (ASR) techniques to pathological voice assessment. The system designed for this particular task relies on a GMM based approach, which is the state-of-the-art for ASR. Experiments conducted on 80 female voices provide promising results, underlining the interest of such an approach. We benefit from the multiplicity of theses techniques to evaluate the methodological situation which points fundamental differences between these complementary approaches (bottom-up vs. top-down, global vs. analytic). We also discuss some theoretical aspects about relationship between acoustic measurement and perceptual mechanisms which are often forgotten in the performance race.
Acoustic noise is a big challenge for speaker recognition systems. The state-of-the-art speaker recognition systems are based on deep neural network speaker embeddings called xvector extractor. A noise-robust x-vector extractor is highly demanded in speaker recognition systems. In this paper, we introduce Barlow Twins self-supervised loss function in the area of speaker recognition. Barlow Twins objective function tries to optimize two criteria: Firstly, it increases the similarity between two versions of the same signal (i.e. the clean and its augmented noisy version) to make the speaker embedding invariant to the acoustic noise. Secondly, it reduces the redundancy between dimensions of the x-vectors that improves the overall quality of speaker embeddings. In our research, Barlow Twins objective function is integrated with the ResNet-based speaker embedding system. In the proposed system, the Barlow Twins objective function is calculated in the embedding layer and it is optimized jointly with the speaker classifier loss function. The experimental results on Fabiole corpus show 22 % relative gain in terms of EER in the clean environments and 18% improvement in the presence of noise with low SNR and reverberation.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.