Visual Voice Activity Detection (V-VAD) involves the detection of speech activity of a speaker using visual features. The V-VAD is useful in detecting the end point of an utterance under noisy acoustic conditions or for maintaining speaker privacy. In this paper, we propose a speaker independent, real-time solution for V-VAD. The focus is on real-time aspect and accuracy as such algorithms will play a key role in detecting end point especially while interacting with speech assistants. We propose two novel methods one using CNN and the other using 2D-DCT features. Unidirectional LSTMs are used in both the methods to make it online and learn temporal dependence. The methods are tested on two publicly available datasets. Additionally the methods are also tested on a locally collected dataset which further validates our hypothesis. Additionally it has been observed through experiments that both the approaches generalize to unseen speakers. It has been shown that our best approach gives substantial improvement over earlier methods done on the same dataset.
Visual speech recognition or lipreading suffers from high word error rate (WER) as lipreading is based solely on articulators that are visible to the camera. Recent works mitigated this problem using complex architectures of deep neural networks. Ivector based speaker adaptation is a well known technique in ASR systems used to reduce WER on unseen speakers. In this work, we explore speaker adaptation of lipreading models using latent identity vectors (visual i-vectors) obtained by factor analysis on visual features. In order to estimate the visual i-vectors, we employ two ways to collect sufficient statistics: first using GMM based universal background model (UBM) and second using RNN-HMM based UBM. The speaker-specific visual i-vector is given as an additional input to the hidden layers of the lipreading model during train and test phases. On GRID corpus, use of visual i-vectors results in 15% and 10% relative improvements over current state of the art lipreading architectures on unseen speakers using RNN-HMM and GMM based methods respectively. Furthermore, we explore the variation of WER with dimension of visual i-vectors, and with the amount of unseen speaker data required for visual i-vector estimation. We also report the results on Korean visual corpus that we created.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.