E ective Human-to-Human communication involves both auditory and visual modalities, providing robustness and naturalness in realistic communication situations. Recent e orts at our lab are aimed at providing such multimodal capabilities for humanmachine communication. Most of the visual modalities require a stable image of a speaker's face. In this paper we propose a connectionist face t r acker that manipulates camera orientation and zoom, to kee p a p erson's face l o c ated at all times. The system operates in real time and can adapt rapidly to di erent lighting conditions, cameras and faces, making it robust against environmental variability. Extensions and integration of the system with a multimodal interface will be p r esented.
We present the development of a modular system for flexible human-computer interaction via speech. The speech recognition component integrates acoustic and visual information (automatic lip-reading) improving overall recognition, especially in noisy environments. The image of the lips, constituting the visual input, is automatically extracted from the camera picture of the speaker's face by the lip locator module. Finally, the speaker's face is automatically acquired and followed by the face tracker sub-system. Integration of the three functions results in the first bi-modal speech recognizer allowing the speaker reasonable freedom of movement within a possibly noisy room while continuing to communicate with the computer via voice. Compared to audio-alone recognition, the combined system achieves a 20 to 50 percent error rate reduction for various signal/noise conditions.
The purpose of this research is to determine how models of human auditory physiology can improve the performance of automatic speech recognition systems. In this study, a series of experiments was undertaken to discover how humans categorize and confuse vowels in natural speech. The recognition task comprised a large number of vowel nuclei isolated from naturally spoken sentences of a large number of talkers. Machine vowel classifiers were trained to match the results of these vowel categorization experiments using two input feature representations: a spectral-energy feature representation, and a representation derived from an auditory model. Classifiers trained to input representations derived from the auditory model match human performance and are more robust in the presence of noise and spectral filtering than classifiers trained to spectral-energy representations.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.