Abstract. Automatic lipreading is automatic speech recognition that uses only visual information. The relevant data in a video signal is isolated and features are extracted from it. From a sequence of feature vectors, where every vector represents one video image, a sequence of higher level semantic elements is formed. These semantic elements are "visemes" the visual equivalent of "phonemes" The developed prototype uses a Time Delayed Neural Network to classify the visemes.
Within the Helium3D project, a wide spectrum of gesture tracking aspects is investigated: Near-field (3D) versus Far-field (2D), proprietary sensors versus web-cams, discrete versus continuous gestures, and data modeling to cover all types of gestures and use cases. This paper gives an overview of those aspects. A common aspect of the described tracking technologies is that they don't rely on cues for which it is likely to encounter exceptions in the real-world, like skin-color filtering in image capturing. The event model generalizes trackers and applications, while at the same time offering a way of tracking persons while they move in or out of the scene
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.