This paper deals with the integration of visual data in automatic speech recognition systems. We rst describe the framework of our research; the development of advanced multi-user multi-modal interfaces. Then we present audiovisual speech recognition problems in general, and the ones we are interested in, in particular. After a very brief discussion of existing systems, the major part of the paper describes the systems we developed according to two dierent approaches to the problem of integration of visual data in speech recognition systems.Section 3 presents the architecture of our audio-only reference and baseline systems. Our audio-visual systems are described in Section 2. We rst describe a system we developed according to the rst approach (called the direct integration model) and show its limitations. Our approach, which w e call asynchronous integration, is then presented in Sectio 4.2. After the general guidelines, we g o i n to some details about the distributed architecture and the variant o f the N-best algorithm we developed for the implementation of this approach.In Section 6 the performances of these dierent systems are compared, then we conclude by a brief discussion of the performance improvements we obtain and future work.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.