In this paper we analyze the advantages of using data mining techniques and tools for data fusion in forensic speaker recognition. Segmental and suprasegmental features were employed in 28 different classifiers, in order to compare their performances. The selected classifiers have different learning techniques: lazy or instance-based, eager and ensemble. Two approaches were employed on the classification task: the use of all features and the use of a feature subset, selected with a gain ratio methodology. The best performances, with all features, were obtained by three classifiers: Logistic Model Tree (eager), LogitBoost (ensemble) and Multilayer Perceptron (eager). Support Vector Machine (eager) proved to be a good classifier if a Pearson VII function-based universal kernel was used. When low dimensional features were selected, ensemble classifiers exceeded the performance of all others classifiers. Segmental and tone features demonstrated the best speaker discrimination capabilities, followed by duration and quality voice features. Evaluation was performed on Argentine-Spanish voice samples from the Speech_Dat database recorded on a fixed telephone environment. Different recording sessions and channels for the test segments were added and the Z-norm procedure was applied for channel compensation.
Due to the difficulties in the recognition of quite similar utterances, such as CV syllables with the same vowel, a two-step approach was proposed. In the first step the normalized log energy, 32 spectral band log energies, and the spectral change were used through a DP algorithm to determine: (a) one of the five broad acoustic classes of the consonant involved (voiced stop, unvoiced stop, nasal, liquid, and fricative) into which the best matched syllable fell and (b) the warping functions. As a second step the test pattern was compared with the reference patterns of the previous recognized class, emphasizing the differentiating regions so as to realize the final recognition of the syllable. The patterns were matched over the warping function taking into account only the frames around the transitional region. The final distances were calculated using only the spectral bands which focused the acoustic class distinctive features. Speaker-dependent performance over the ten more frequent Spanish CV syllables was improved from 78% to 99% with the two-step procedure instead of considering only the first step as final recognition of the syllable.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.