We consider the design of two-pass voice trigger detection systems. We focus on the networks in the second pass that are used to re-score candidate segments obtained from the firstpass. Our baseline is an acoustic model(AM), with BiLSTM layers, trained by minimizing the CTC loss. We replace the BiLSTM layers with self-attention layers. Results on internal evaluation sets show that self-attention networks yield better accuracy while requiring fewer parameters. We add an autoregressive decoder network on top of the self-attention layers and jointly minimize the CTC loss on the encoder and the crossentropy loss on the decoder. This design yields further improvements over the baseline. We retrain all the models above in a multi-task learning(MTL) setting, where one branch of a shared network is trained as an AM, while the second branch classifies the whole sequence to be true-trigger or not. Results demonstrate that networks with self-attention layers yield ∼60% relative reduction in false reject rates for a given false-alarm rate, while requiring 10% fewer parameters. When trained in the MTL setup, self-attention networks yield further accuracy improvements. On-device measurements show that we observe 70% relative reduction in inference time. Additionally, the proposed network architectures are ∼5X faster to train.
We present a method to generate speech from input text and a style vector that is extracted from a reference speech signal in an unsupervised manner, i.e., no style annotation, such as speaker information, is required. Existing unsupervised methods, during training, generate speech by computing style from the corresponding ground truth sample and use a decoder to combine the style vector with the input text. Training the model in such a way leaks content information into the style vector. The decoder can use the leaked content and ignore some of the input text to minimize the reconstruction loss. At inference time, when the reference speech does not match the content input, the output may not contain all of the content of the input text. We refer to this problem as "content leakage", which we address by explicitly estimating and minimizing the mutual information between the style and the content through an adversarial training formulation. We call our method MIST -Mutual Information based Style Content Separation. The main goal of the method is to preserve the input content in the synthesized speech signal, which we measure by the word error rate (WER) and show substantial improvements over state-of-the-art unsupervised speech synthesis methods.
A conventional linear model based on Negentropy maximization extracts statistically independent latent variables which may not be optimal to give a discriminant model with good classification performance. In this paper, a single-stage linear semisupervised extraction of discriminative independent features is proposed. Discriminant independent component analysis (dICA) presents a framework of linearly projecting multivariate data to a lower dimension where the features are maximally discriminant with minimal redundancy. The optimization problem is formulated as the maximization of linear summation of Negentropy and weighted functional measure of classification. Motivated by independence among extracted features, Fisher linear discriminant is used as the functional measure of classification. Experimental results show improved classification performance when dICA features are used for recognition tasks in comparison to unsupervised (principal component analysis and ICA) and supervised feature extraction techniques like linear discriminant analysis (LDA), conditional ICA, and those based on information theoretic learning approaches. dICA features also give reduced data reconstruction error in comparison to LDA and ICA method based on Negentropy maximization.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.