Robust features for text-independent speaker recognition with short utterances

Chakroun, Rania; Frikha, Mounir

doi:10.1007/s00521-020-04793-y

Cited by 14 publications

(7 citation statements)

References 62 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, the performance of the state-of-the-art speaker recognition systems dramatically deteriorates when short utterances are used for training/testing particularly with low SNR [ 29 , 30 ]. In spite of the fact that deep neural networks (DNNs) provide a state-of-the-art tool for acoustic modeling, DNNs are data sensitive, and limited speech data as well as data-mismatch problems can deteriorate their performance [ 4 , 5 , 31 ].…”

Section: Related Workmentioning

confidence: 99%

“…However, deep learning systems require huge speech databases to be labeled and trained; theses databases also need to include phonetically rich sentences or at least phonetically balanced sentences [ 31 ]. In addition, most of speaker recognition systems that were developed based on deep learning techniques have been applied to text-dependent speaker verification tasks [ 4 ]. Hence, training deep learning systems on limited data is a difficult task and may not necessarily lead to speaker recognition systems with state-of-the-art performance.…”

Section: Related Workmentioning

confidence: 99%

“…A speaker identification system for social human–robot interaction, which is the main motivation of this research, should be able to extract a voice signature from unconstrained utterances while maintains its performance at challenging scenarios including various low signal-to-noise ratio and short length of utterance as well as different types of noise. However, most of the state-of-the-art speaker recognition systems, including those based on deep learning algorithms, have achieved significant performance on verification tasks which are not suitable for social human–robot interaction [ 4 , 5 , 6 ]. In speaker diarisation, which is a key feature for social robots, also known as “who spoke when”, a speech signal is partitioned into homogenous segments according to speaker identity [ 7 ].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

A Two-Level Speaker Identification System via Fusion of Heterogeneous Classifiers and Complementary Feature Cooperation

Al-Qaderi

Lahamer

Rad

2021

Sensors

View full text Add to dashboard Cite

We present a new architecture to address the challenges of speaker identification that arise in interaction of humans with social robots. Though deep learning systems have led to impressive performance in many speech applications, limited speech data at training stage and short utterances with background noise at test stage present challenges and are still open problems as no optimum solution has been reported to date. The proposed design employs a generative model namely the Gaussian mixture model (GMM) and a discriminative model—support vector machine (SVM) classifiers as well as prosodic features and short-term spectral features to concurrently classify a speaker’s gender and his/her identity. The proposed architecture works in a semi-sequential manner consisting of two stages: the first classifier exploits the prosodic features to determine the speaker’s gender which in turn is used with the short-term spectral features as inputs to the second classifier system in order to identify the speaker. The second classifier system employs two types of short-term spectral features; namely mel-frequency cepstral coefficients (MFCC) and gammatone frequency cepstral coefficients (GFCC) as well as gender information as inputs to two different classifiers (GMM and GMM supervector-based SVM) which in total leads to construction of four classifiers. The outputs from the second stage classifiers; namely GMM-MFCC maximum likelihood classifier (MLC), GMM-GFCC MLC, GMM-MFCC supervector SVM, and GMM-GFCC supervector SVM are fused at score level by the weighted Borda count approach. The weight factors are computed on the fly via Mamdani fuzzy inference system that its inputs are the signal to noise ratio and the length of utterance. Experimental evaluations suggest that the proposed architecture and the fusion framework are promising and can improve the recognition performance of the system in challenging environments where the signal-to-noise ratio is low, and the length of utterance is short; such scenarios often arise in social robot interactions with humans.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A Two-Level Speaker Identification System via Fusion of Heterogeneous Classifiers and Complementary Feature Cooperation

Al-Qaderi

Lahamer

Rad

2021

Sensors

View full text Add to dashboard Cite

show abstract

“…The ubiquitous noise masks or submerges the voice, changing the auditory, spectral, and other acoustic features. These changes increase the challenge of acquiring the target speaker's characteristics for speech processing, thereby reducing the performance of voice identity recognition [8]. Some methods have been extensively used to compensate mismatched voice identification in noisy environments, such as a multiscale chaotic feature [9], signal enhancement [10], the model compensation method [11], and score normalization [12].…”

Section: Introductionmentioning

confidence: 99%

Perceptual Characteristics of Voice Identification in Noisy Environments

2022

View full text Add to dashboard Cite

Auditory analysis is an essential method that is used to recognize voice identity in court investigations. However, noise will interfere with auditory perception. Based on this, we selected white noise, pink noise, and speech noise in order to design and conduct voice identity perception experiments. Meanwhile, we explored the impact of the noise type and frequency distribution on voice identity perception. The experimental results show the following: (1) in high signal-to-noise ratio (SNR) environments, there is no significant difference in the impact of noise types on voice identity perception; (2) in low SNR environments, the perceived result of speech noise is significantly different from that of white noise and pink noise, and the interference is more obvious; (3) in the speech noise with a low SNR (−8 dB), the voice information contained in the high-frequency band of 2930~6250 Hz is helpful for achieving accuracy in voice identity perception. These results show that voice identity perception in a better voice transmission environment is mainly based on the acoustic information provided by the low-frequency and medium-frequency bands, which concentrate most of the energy of the voice. As the SNR gradually decreases, a human’s auditory mechanism will automatically expand the receiving frequency range to obtain more effective acoustic information from the high-frequency band. Consequently, the high-frequency information ignored in the objective algorithm may be more robust with respect to identity perception in our environment. The experimental studies not only evaluate the quality of the case voice and control the voice recording environment, but also predict the accuracy of voice identity perception under noise interference. This research provides the theoretical basis and data support for applying voice identity perception in forensic science.

show abstract

“…Besides noise, the duration of utterances is a factor of high importance for the accuracy. [17] described the use of gammatone features in combination with i-vector on short utterances, [18] trained deep convolutional networks specifically on short utterances. [1] described the use of inception networks by transforming utterances to fixed length spectrograms and training via triplet loss.…”

Section: Introductionmentioning

confidence: 99%

Length- and Noise-aware Training Techniques for Short-utterance Speaker Recognition

Chen¹,

Huang²

2020

Preprint

View full text Add to dashboard Cite

Speaker recognition performance has been greatly improved with the emergence of deep learning. Deep neural networks show the capacity to effectively deal with impacts of noise and reverberation, making them attractive to far-field speaker recognition systems. The x-vector framework is a popular choice for generating speaker embeddings in recent literature due to its robust training mechanism and excellent performance in various test sets. In this paper, we start with early work on including invariant representation learning (IRL) to the loss function and modify the approach with centroid alignment (CA) and length variability cost (LVC) techniques to further improve robustness in noisy, far-field applications. This work mainly focuses on improvements for short-duration test utterances (1-8s). We also present improved results on long-duration tasks. In addition, this work discusses a novel self-attention mechanism. On the VOiCES far-field corpus, the combination of the proposed techniques achieves relative improvements of 7.0% for extremely short and 8.2% for full-duration test utterances on equal error rate (EER) over our baseline system.

show abstract

Robust features for text-independent speaker recognition with short utterances

Cited by 14 publications

References 62 publications

A Two-Level Speaker Identification System via Fusion of Heterogeneous Classifiers and Complementary Feature Cooperation

A Two-Level Speaker Identification System via Fusion of Heterogeneous Classifiers and Complementary Feature Cooperation

Perceptual Characteristics of Voice Identification in Noisy Environments

Length- and Noise-aware Training Techniques for Short-utterance Speaker Recognition

Contact Info

Product

Resources

About