Frequency and Multi-Scale Selective Kernel Attention for Speaker Verification

Mun, Sung Hwan; Jung, Jee-weon; Han, Min Hyun; Kim, Nam Soo

doi:10.48550/arxiv.2204.01005

Cited by 2 publications

(2 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In addition, based on the SE-res2block structure, the output of the previous layer can be used as a skip connection structure to utilize multilayer information. As mentioned in [54], considering that shallower feature maps can yield more robust speaker embeddings, we extracted and used a 192-dimensional x-vector as our speaker embedding, a departure from the commonly employed 256-or 512dimensional embeddings. The Voxceleb2 dataset was used for training, and ADAMW was used as the optimizer.…”

Section: ) Speaker Encodermentioning

confidence: 99%

Perturbation AUTOVC: Voice Conversion From Perturbation and Autoencoder Loss

Park,

Lee,

Chun

2023

IEEE Access

View full text Add to dashboard Cite

AUTOVC is a voice-conversion method that performs self-reconstruction using an autoencoder structure for zero-shot voice conversion. AUTOVC has the advantage of being easy and simple to learn because it only uses the autoencoder loss for learning. However, it performs voice conversion by disentangling speech information from speakers and linguistic information by adjusting the bottleneck dimension; this requires highly meticulous fine tuning of the bottleneck dimension and involves a tradeoff between speech quality and speaker similarity. To address these issues, neural analysis and synthesis (NANSY)-a fully self-supervised learning system that uses perturbations to extract speech features-is proposed. NANSY solves the problem of the adjustment of the bottleneck dimension by utilizing perturbation and exhibits high-reconstruction performance. In this study, we propose perturbation AUTOVC, a voice conversion method that utilizes the structure of AUTOVC and the perturbation of NANSY. The proposed method applies perturbations to speech signals (such as NANSY signals) to solve the problem of the voice conversion method using bottleneck dimensions. Perturbation is applied to remove the speaker-dependent information present in the speech, leaving only the linguistic information, which is then passed through a content encoder and modeled as a content embedding containing only the linguistic information. To obtain speaker information, we used x-vectors, which are extensively used in pretrained speaker recognition. The concatenated linguistic and speaker information extracted from the encoder and additional energy information is used as input to the decoder to perform self-reconstruction. Similar to AUTOVC, it is easy and simple to learn using only the autoencoder loss. For the evaluation, we measured three objective evaluation metrics: character error rate (%), cosine similarity, and short-time objective intelligibility, as well as a subjective evaluation metric: mean opinion score. The experimental results demonstrate that our proposed method outperforms other voice conversion techniques and demonstrated robust performance in zero-shot conversion.

show abstract

Section: ) Speaker Encodermentioning

confidence: 99%

Perturbation AUTOVC: Voice Conversion From Perturbation and Autoencoder Loss

Park,

Lee,

Chun

2023

IEEE Access

View full text Add to dashboard Cite

show abstract

“…To force the speaker embeddings to discriminate their speaker labels, we adopt the combination of the additive angular margin (AAM) softmax [32] and the angular prototypical (AP) loss [33], which has shown the great performance in this field [4], [34]. Given the pairs of speaker embeddings and labels {(x s i , y s i )} N i=1 , the speaker classification loss function is formulated as follows: ,…”

Section: Speaker Classifier C Smentioning

confidence: 99%

Disentangled Speaker Representation Learning via Mutual Information Minimization

Mun¹,

Han²,

Kim³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Domain mismatch problem caused by speakerunrelated feature has been a major topic in speaker recognition. In this paper, we propose an explicit disentanglement framework to unravel speaker-relevant features from speakerunrelated features via mutual information (MI) minimization. To achieve our goal of minimizing MI between speaker-related and speaker-unrelated features, we adopt a contrastive log-ratio upper bound (CLUB), which exploits the upper bound of MI. Our framework is constructed in a 3-stage structure. First, in the front-end encoder, input speech is encoded into shared initial embedding. Next, in the decoupling block, shared initial embedding is split into separate speaker-related and speakerunrelated embeddings. Finally, disentanglement is conducted by MI minimization in the last stage. Experiments on Far-Field Speaker Verification Challenge 2022 (FFSVC2022) demonstrate that our proposed framework is effective for disentanglement. Also, to utilize domain-unknown datasets containing numerous speakers, we pre-trained the front-end encoder with VoxCeleb datasets. We then fine-tuned the speaker embedding model in the disentanglement framework with FFSVC 2022 dataset. The experimental results show that fine-tuning with a disentanglement framework on a existing pre-trained model is valid and can further improve performance.

show abstract

Frequency and Multi-Scale Selective Kernel Attention for Speaker Verification

Cited by 2 publications

References 20 publications

Perturbation AUTOVC: Voice Conversion From Perturbation and Autoencoder Loss

Perturbation AUTOVC: Voice Conversion From Perturbation and Autoencoder Loss

Disentangled Speaker Representation Learning via Mutual Information Minimization

Contact Info

Product

Resources

About