1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings 1996
DOI: 10.1109/icassp.1996.543246
|View full text |Cite
|
Sign up to set email alerts
|

Visual speech recognition using active shape models and hidden Markov models

Abstract: This paper describes a novel approach for visual speech recognition. The shape of the mouth is modelled by an Active Shape Model which is derived from the statistics of a training set and used to locate, track and parameterise the speaker's lip movements. The extracted parameters representing the lip shape are modelled as continuous probability distributions and their temporal dependencies are modelled by Hidden Markov Models. We present recognition tests performed on a database of a broad variety of speakers … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
44
0
2

Year Published

1997
1997
2014
2014

Publication Types

Select...
4
4
2

Relationship

0
10

Authors

Journals

citations
Cited by 78 publications
(46 citation statements)
references
References 12 publications
0
44
0
2
Order By: Relevance
“…An early approach that exploits the characteristic temporal signature of faces based on partially recurrent neural networks trained over sequences of facial images was first introduced by Gong, Psarrou, Katsoulis, and Palavouzis (1994). Initial experiments conducted by Luettin and colleagues suggested that spatiotemporal models (HMMs) trained on sequences of lip motion during speech could be useful for speaker recognition (Luettin, Thacker, & Beet, 1996). However, beyond this early experiment, the use of spatiotemporal cues for the identification of people in computer vision has remained relatively unexplored (Gong, McKenna, & Psarrou, 2000).…”
Section: Computer Vision Models For the Processing Of Dynamic Facesmentioning
confidence: 99%
“…An early approach that exploits the characteristic temporal signature of faces based on partially recurrent neural networks trained over sequences of facial images was first introduced by Gong, Psarrou, Katsoulis, and Palavouzis (1994). Initial experiments conducted by Luettin and colleagues suggested that spatiotemporal models (HMMs) trained on sequences of lip motion during speech could be useful for speaker recognition (Luettin, Thacker, & Beet, 1996). However, beyond this early experiment, the use of spatiotemporal cues for the identification of people in computer vision has remained relatively unexplored (Gong, McKenna, & Psarrou, 2000).…”
Section: Computer Vision Models For the Processing Of Dynamic Facesmentioning
confidence: 99%
“…The first is a top-down approach, where an a priori lipshape representation framework is embedded in a model; for example, active shape models (ASMs) [31] and active appearance models (AAMs) [11]. ASMs and AAMs extract higher-level, model-based features derived from the shape and appearance of mouth area images.…”
Section: Visual Feature Extraction Mechanismsmentioning
confidence: 99%
“…However, these approaches cannot adapt to considerable geometric changes in lips in every frame. In the model-based approach, active shape model (6) and genetic snakes (8) (25) , that is, an improved version of Snakes (26) , have been proposed. These approaches have some constraints, such as, the target image must be a face region of fixed pose and a subject wears a camera to get a mouth image, etc.…”
Section: Related Workmentioning
confidence: 99%