Speaker recognition methods are negatively affected by the input short duration audio signal. In this paper, we tackle the problem of speaker recognition from shorter speech data by coupling proposed two acoustic features: Bark-scaled Gauss Filter Cepstral Coefficients (BGCC) and Perceptual Wavelet Packet Entropy (PWPE). Our assumption is based on the observation that BGCC and PWPE capture sufficient information on various aspects of speech that can be used to discriminate speaker, viz., speech perception and high time-frequency information representation, etc., for enhancing characteristic diversity. A triplet dual attention mechanism (Triplet-DAM) is used to couple these two features in a creative manner. The coupling method means that the feature of limited short utterance can be reused, through the dual attention mechanism, more discriminative features are enhanced for limited feature, thus improving speaker recognition performance in short duration audio signals. Extensive analysis on a variety of datasets, which speech samples of different types, diverse lengths, etc., demonstrate the superiority of the proposed feature engineering and method over existing acoustic feature extraction and speaker recognition algorithms, including those based on MFCCs, LPCCs features, and GMM-UBM, iVector-PLDA, ResCNN-triplet, respectively. The experimental results demonstrate the proposed method achieves notable improvement with the existing approach for short duration speaker recognition.
In recent years, text-independent speaker verification has remained a hot research topic, especially for the limited enrollment and/or test data. At the same time, due to the lack of sufficient training data, the study of low-resource few-shot speaker verification, makes the models prone to overfitting and low accuracy of recognition. Therefore, a bidirectional sampling aggregation-based meta metric learning method is proposed to solve the low accuracy problem of speaker recognition in a low-resource environment with limited data, termed BSML. Firstly, the BSML method was used for effective feature enhancement in the feature extraction stage; secondly, a large number of similar and disjoint tasks were used to train the models to learn how to compare sample similarity; and finally, new tasks were used to identify unknown samples by calculating the similarity of the samples. Extensive experiments are conducted on a short-duration text-independent speaker verification dataset generated from the THUYG-20 low-resource Uyghur with limited data, which speech samples of diverse lengths. The experimental result has shown that the metric learning approach is effective in avoiding model overfitting and improving model generalization, with significant results in the identification of short-duration speaker verification in low-resource Uyghur with few-shot. It also demonstrates that BSML outperforms the state-of-the-art deep embedding speaker recognition architectures and recent metric learning approach by at least 18%-67% in the few-shot test set. The ablation experiments further illustrate that our proposed approaches can achieve substantial improvement over prior methods, and achieves better performance and generalization ability.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.