In this paper, we propose to use Convolutional Restricted Boltzmann Machine (ConvRBM) to learn filterbank from the raw audio signals. ConvRBM is a generative model trained in an unsupervised way to model the audio signals of arbitrary lengths. ConvRBM is trained using annealed dropout technique and parameters are optimized using Adam optimization. The subband filters of ConvRBM learned from the ESC-50 database resemble Fourier basis in the mid-frequency range while some of the low-frequency subband filters resemble Gammatone basis. The auditory-like filterbank scale is nonlinear w.r.t. the center frequencies of the subband filters and follows the standard auditory scales. We have used our proposed model as a front-end for the Environmental Sound Classification (ESC) task with supervised Convolutional Neural Network (CNN) as a back-end. Using CNN classifier, the ConvRBM filterbank (ConvRBM-BANK) and its score-level fusion with the Mel filterbank energies (FBEs) gave an absolute improvement of 10.65 %, and 18.70 % in the classification accuracy, respectively, over FBEs alone on the ESC-50 database. This shows that the proposed ConvRBM filterbank also contains highly complementary information over the Mel filterbank, which is helpful in the ESC task.
Abstract-In this paper, we propose to use modified Gammatone filterbank with Teager Energy Operator (TEO) for environmental sound classification (ESC) task. TEO can track energy as a function of both amplitude and frequency of an audio signal. TEO is better for capturing energy variations in the signal that is produced by a real physical system, such as, environmental sounds that contain amplitude and frequency modulations. In proposed feature set, we have used Gammatone filterbank since it represents characteristics of human auditory processing. Here, we have used two classifiers, namely, Gaussian Mixture Model (GMM) using cepstral features, and Convolutional Neural Network (CNN) using spectral features. We performed experiments on two datasets, namely, ESC-50, and UrbanSound8K. We compared TEO-based coefficients with Mel filter cepstral coefficients (MFCC) and Gammatone cepstral coefficients (GTCC), in which GTCC used mean square energy. Using GMM, the proposed TEO-based Gammatone Cepstral Coefficients (TEO-GTCC), and its score-level fusion with MFCC gave absolute improvement of 0.45 %, and 3.85 % in classification accuracy over MFCC on ESC-50 dataset. Similarly, on UrbanSound8K dataset the proposed TEO-GTCC, and its scorelevel fusion with GTCC gave absolute improvement of 1.40 %, and 2.44 % in classification accuracy over MFCC. Using CNN, the score-level fusion of Gammatone spectral coefficient (GTSC) and the proposed TEO-based Gammatone spectral coefficients (TEO-GTSC) gave absolute improvement of 14.10 %, and 14.52 % in classification accuracy over Mel filterbank energies (FBE) on ESC-50 and UrbanSond8K datasets, respectively. This shows that proposed TEO-based Gammatone features contain complementary information which is helpful in ESC task.
In recent years, automatic speaker verification (ASV) is used extensively for voice biometrics. This leads to an increased interest to secure these voice biometric systems for real-world applications. The ASV systems are vulnerable to various kinds of spoofing attacks, namely, synthetic speech (SS), voice conversion (VC), replay, twins, and impersonation. This paper provides the literature review of ASV spoof detection, novel acoustic feature representations, deep learning, end-to-end systems, etc. Furthermore, the paper also summaries previous studies of spoofing attacks with emphasis on SS, VC, and replay along with recent efforts to develop countermeasures for spoof speech detection (SSD) task. The limitations and challenges of SSD task are also presented. While several countermeasures were reported in the literature, they are mostly validated on a particular database, furthermore, their performance is far from perfect. The security of voice biometrics systems against spoofing attacks remains a challenging topic. This paper is based on a tutorial presented at APSIPA Annual Summit and Conference 2017 to serve as a quick start for those interested in the topic.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.