The task of personalized keyword detection system which also performs text dependent speaker verification (TDSV) has received substantial interest recently. Conventional approaches to this task involve the development of the TDSV and wakeup-word detection systems separately. In this paper, we show that TDSV and keyword spotting (KWS) can be jointly modeled using the convolutional long short term memory (CLSTM) model architecture, where an initial convolutional feature map is further processed by a LSTM recurrent network. Given a small amount of training data for developing the CLSTM system, we show that the model provides accurate detection of the presence of the keyword in spoken utterance. For the TDSV task, the MTL model can be well regularized using the CLSTM training examples for personalized wake up task. The experiments are performed for KWS wake up detection and TDSV using the combined speech recordings from Wall Street Journal (WSJ) and LibriSpeech corpus. In these experiments with multiple keywords, we illustrate that the proposed approach of MTL significantly improves the performance of previously proposed neural network based text dependent SV systems. We also experimentally illustrate that the CLSTM model provides significant improvements over previously proposed keyword detection systems as well (average relative improvements of 30% over previous approaches).
The language recognition evaluation (LRE) 2017 challenge comprises an open evaluation of the language identification (LID) task on a set of 14 languages/dialects. In this paper, we describe our submission to the LRE 2017 challenge fixed condition which consisted of developing various LID systems using i-vector based modeling. The front end processing is performed using deep neural network (DNN) based bottleneck features for i-vector modeling with a Gaussian mixture model (GMM) universal background model (UBM) approach. Several back-end systems consisting of support vector machines (SVMs) and deep neural network (DNN) models were used for the language/dialect classification. The submission system achieved significant improvements over the evaluation baseline system provided by NIST (relative improvements of more than 50% over the baseline). In the later part of the paper, we detail our post evaluation efforts to improve the language recognition system for short duration speech data using novel approaches of sequence modeling of segment i-vectors. The post evaluation efforts resulted in further improvements over the submitted system (relative improvements of about 22 %). An error analysis is also presented which highlights the confusions and errors in the final system.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.