We propose variable-text text-dependent speaker-recognition systems based on the one-pass dynamic programming (DP) algorithm. The key feature of the proposed algorithm is its ability to use multiple templates for each of the words which form the "password' text. The use of multiple templates allows the proposed system to capture the idiosyncratic intra-speaker variability of a word, resulting in significant improvement in the performance. Our algorithm also uses inter-word silence templates to handle continuous speech input. We use the proposed onepass DP algorithm in three speaker-recognition systems, namely, closed-set speaker-identification (CSI), speaker-verification (SV) and open-set speaker-identification (OSI). These systems were evaluated on a 100 speaker and 200 speaker tasks using the TIDIGITS database and with various car noise conditions. The key result of this paper is that the use of multiple templates enhances the performance of all the three systems significantlythe use ofmultiple templates (in comparison to a single template) enhances the CSI performance from 94% to 100%, the SV EER from 1.6% to 0.09% and the OSI EER from 12.3% to 3.5% on a 100 speaker task. We also use the proposed one-pass DP for automatically extracting the multiple templates from continuous speech training data. The performance of the three systems using such automatically extracted multiple templates is as good as with manually extracted templates. Front-end noise suppression enables our systems to deliver robust performance in up to 0 dB car noise.
Visual Speech Recognition aims at transcribing lip movements into readable text. There have been many strides in automatic speech recognition systems that can recognize words with audio and visual speech features, even under noisy conditions. This paper focuses only on the visual features, while a robust system uses visual features to support acoustic features. We propose the concatenation of visemes (lip movements) for text classification rather than a classic individual viseme mapping. The result shows that this approach achieves a significant improvement over the state-of-the-art models. The system has two modules; the first one extracts lip features from the input video, while the next is a neural network system trained to process the viseme sequence and classify it as text.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.