Automatic Gender Recognition (AGR) is the task of identifying the gender of a speaker given a speech signal. Standard approaches extract features like fundamental frequency and cepstral features from the speech signal and train a binary classifier. Inspired from recent works in the area of automatic speech recognition (ASR), speaker recognition and presentation attack detection, we present a novel approach where relevant features and classifier are jointly learned from the raw speech signal in end-to-end manner. We propose a convolutional neural networks (CNN) based gender classifier that consists of: (1) convolution layers, which can be interpreted as a feature learning stage and (2) a multilayer perceptron (MLP), which can be interpreted as a classification stage. The system takes raw speech signal as input, and outputs gender posterior probabilities. Experimental studies conducted on two datasets, namely AVspoof and ASVspoof 2015, with different architectures show that with simple architectures the proposed approach yields better system than standard acoustic features based approach. Further analysis of the CNNs show that the CNNs learn formant and fundamental frequency information for gender identification.
Children speech recognition based on short-term spectral features is a challenging task. One of the reasons is that children speech has high fundamental frequency that is comparable to formant frequency values. Furthermore, as children grow, their vocal apparatus also undergoes changes. This presents difficulties in extracting standard short-term spectral-based features reliably for speech recognition. In recent years, novel acoustic modeling methods have emerged that learn both the feature and phone classifier in an end-to-end manner from the raw speech signal. Through an investigation on PF-STAR corpus we show that children speech recognition can be improved using end-to-end acoustic modeling methods.
In Bourlard and Kamp (Biol Cybern 59(4):291–294, 1998), it was theoretically proven that autoencoders (AE) with single hidden layer (previously called “auto-associative multilayer perceptrons”) were, in the best case, implementing singular value decomposition (SVD) Golub and Reinsch (Linear algebra, Singular value decomposition and least squares solutions, pp 134–151. Springer, 1971), equivalent to principal component analysis (PCA) Hotelling (Educ Psychol 24(6/7):417–441, 1993); Jolliffe (Principal component analysis, springer series in statistics, 2nd edn. Springer, New York ). That is, AE are able to derive the eigenvalues that represent the amount of variance covered by each component even with the presence of the nonlinear function (sigmoid-like, or any other nonlinear functions) present on their hidden units. Today, with the renewed interest in “deep neural networks” (DNN), multiple types of (deep) AE are being investigated as an alternative to manifold learning Cayton (Univ California San Diego Tech Rep 12(1–17):1, 2005) for conducting nonlinear feature extraction or fusion, each with its own specific (expected) properties. Many of those AE are currently being developed as powerful, nonlinear encoder–decoder models, or used to generate reduced and discriminant feature sets that are more amenable to different modeling and classification tasks. In this paper, we start by recalling and further clarifying the main conclusions of Bourlard and Kamp (Biol Cybern 59(4):291–294, 1998), supporting them by extensive empirical evidences, which were not possible to be provided previously (in 1988), due to the dataset and processing limitations. Upon full understanding of the underlying mechanisms, we show that it remains hard (although feasible) to go beyond the state-of-the-art PCA/SVD techniques for auto-association. Finally, we present a brief overview on different autoencoder models that are mainly in use today and discuss their rationale, relations and application areas.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.