We present recent w ork on improving the performance of automated speech recognizers by using additional visual information Lip-Speechreading, achieving error reduction of up to 50. This paper focuses on di erent methods of combining the visual and acoustic data to improve the recognition performance. We show this on an extension of an existing state-of-the-art speech recognition system, a modular MS-TDNN. We have developed adaptive combination methods at several levels of the recognition network. Additional information such as estimated signal-to-noise ratio SNR is used in some cases. The results of the di erent combination methods are shown for clean speech and data with arti cial noise white, music, motor. The new combination methods adapt automatically to varying noise conditions making hand-tuned parameters unnecessary.
We present the development of a modular system for flexible human-computer interaction via speech. The speech recognition component integrates acoustic and visual information (automatic lip-reading) improving overall recognition, especially in noisy environments. The image of the lips, constituting the visual input, is automatically extracted from the camera picture of the speaker's face by the lip locator module. Finally, the speaker's face is automatically acquired and followed by the face tracker sub-system. Integration of the three functions results in the first bi-modal speech recognizer allowing the speaker reasonable freedom of movement within a possibly noisy room while continuing to communicate with the computer via voice. Compared to audio-alone recognition, the combined system achieves a 20 to 50 percent error rate reduction for various signal/noise conditions.
In manual-cued speech (MCS) a speaker produces hand gestures to resolve ambiguities among speech elements that are often confused by speechreaders. The shape of the hand distinguishes among consonants; the position of the hand relative to the face distinguishes among vowels. Experienced receivers of MCS achieve nearly perfect reception of everyday connected speech. MCS has been taught to very young deaf children and greatly facilitates language learning, communication, and general education. This manuscript describes a system that can produce a form of cued speech automatically in real time and reports on its evaluation by trained receivers of MCS. Cues are derived by a hidden markov models (HMM)-based speaker-dependent phonetic speech recognizer that uses context-dependent phone models and are presented visually by superimposing animated handshapes on the face of the talker. The benefit provided by these cues strongly depends on articulation of hand movements and on precise synchronization of the actions of the hands and the face. Using the system reported here, experienced cue receivers can recognize roughly two-thirds of the keywords in cued low-context sentences correctly, compared to roughly one-third by speechreading alone (SA). The practical significance of these improvements is to support fairly normal rates of reception of conversational speech, a task that is often difficult via SA.
In this paper, we present an overview of research in our laboratories on Multimodal Human Computer Interfaces. The goal for such interfaces is to free human computer interaction from the limitations and acceptance barriers due to rigid operating commands and keyboards as the only/main I/O-device. Instead we move to involve all available human communication modalities. These human modalities include Speech, Gesture and Pointing, Eye-Gaze, Lip Motion and Facial Expression, Handwriting, Face Recognition, Face Tracking, and Sound Localization.
An algorithm to simulate the effects of sensorineural hearing impairment on speech reception was investigated. Like that described by Villchur [J. Acoust. Soc. Am. 62, 665-674 (1977)], this simulation employs automatic gain control in independent frequency bands to reproduce the elevated audibility thresholds and loudness recruitment that are characteristic of this type of loss. In the present implementation, band gains are controlled in an effort to simulate loudness recruitment directly, using recruitment functions that depend only on the magnitude of hearing loss in the band. In a preliminary evaluation, two normal-hearing subjects listened to the simulation matched to hearing losses studied previously [Zurek and Delhorne, J. Acoust. Soc. Am. 82, 1548-1559 (1987)] with noise-masking simulations. This evaluation indicated that the present automatic gain control simulation yielded scores roughly similar to those of both the hearing-impaired listeners and the masked-normal listeners. In the more-detailed evaluation, the performance of three listeners with severe sensorineural hearing loss on several speech intelligibility tests was compared to that of normal-hearing subjects listening to the output of the simulation. These tests included consonant-vowel syllable identification and sentence keyword identification for several combinations of speech-to-noise ratio, frequency-gain characteristic, and overall level. Generally, the simulation algorithm reproduced speech intelligibility well, though there was a clear trend for the simulation to result in better intelligibility than observed for impaired listeners when high-frequency emphasis placed more of the speech spectrum above threshold at higher frequencies. Also, the hearing-impaired listener with the greatest loss showed the largest discrepancies with the simulation. Overall, however, the simulation provides a very good approximation to speech reception by hearing-impaired listeners. The results of this study, together with previous studies of noise-making simulation, suggest that threshold elevation and recruitment, which are necessary features of a simulation of cochlear hearing loss, can also be largely sufficient for simulating the speech-reception performance of listeners with moderate to severe hearing impairments.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.