2012
DOI: 10.1121/1.3697534
|View full text |Cite
|
Sign up to set email alerts
|

Spectro-temporal modulation energy based mask for robust speaker identification

Abstract: Spectro-temporal modulations of speech encode speech structures and speaker characteristics. An algorithm which distinguishes speech from non-speech based on spectro-temporal modulation energies is proposed and evaluated in robust text-independent closed-set speaker identification simulations using the TIMIT and GRID corpora. Simulation results show the proposed method produces much higher speaker identification rates in all signal-to-noise ratio (SNR) conditions than the baseline system using mel-frequency ce… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

0
4
0

Year Published

2015
2015
2019
2019

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 10 publications
(4 citation statements)
references
References 18 publications
0
4
0
Order By: Relevance
“…In [44], an algorithm which distinguishes speech from non-speech based on spectro-temporal modulation energies is proposed and evaluated in robust text-independent closed-set speaker identification simulations. An exemplar-based representation and sparse discrimination was proposed in [45] that outperformed the baseline GMM-universal background model (UBM) and HMM-based systems with a large margin.…”
Section: Performance Improvementmentioning
confidence: 99%
“…In [44], an algorithm which distinguishes speech from non-speech based on spectro-temporal modulation energies is proposed and evaluated in robust text-independent closed-set speaker identification simulations. An exemplar-based representation and sparse discrimination was proposed in [45] that outperformed the baseline GMM-universal background model (UBM) and HMM-based systems with a large margin.…”
Section: Performance Improvementmentioning
confidence: 99%
“…The digitized speech samples are converted into the Mal-Frequency Cepstrum Coefficient (MFCC) to produce the features of the wave vectors, which can be then used in learning algorithms such as SVM, and ANN. Even though most of the known acoustic-property-based algorithms have achieved an almost 100% level of accuracy in an uncluttered environment [14], [15], the performance of these algorithms is degraded with channel variations due to the microphones, used or environmental or background distortions [14].…”
Section: Introductionmentioning
confidence: 99%
“…In general, the features from the speech signal for speaker recognition are extracted by modeling the human voice production system (such as linear prediction cepstral coefficient, LPCC [ 5 ]) or from the responses of human auditory system. Human listeners are capable of recognizing speakers in noisy environments, while most of the traditional speaker recognition systems do not perform well in the presence of noise [ 6 ]. Unlike traditional methods in which features are extracted from the properties of the acoustic signal, this study proposes a speaker identification technique using neural responses from a physiologically-based computational model of the auditory periphery.…”
Section: Introductionmentioning
confidence: 99%
“…The traditional MFCC-based system achieved almost 100% classification accuracy in clean condition [ 7 , 8 ]. However, the performance of these acoustic-property-based methods degrades substantially for speech signals under channel variations induced by the handset or microphones as well as for environmental or background distortions [ 6 ]. In recent years, efforts have been made aiming to extract features by removing the noise from the speaker characteristic information directly such as cepstral mean normalization [ 9 ], RASTA processing [ 10 ], warping methods [ 11 ], and robust parameterizations [ 12 ].…”
Section: Introductionmentioning
confidence: 99%