Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96
DOI: 10.1109/icslp.1996.607019
|View full text |Cite
|
Sign up to set email alerts
|

Audiovisual speech recognition using multiscale nonlinear image decomposition

Abstract: There has recently been increasing interest in the idea of enhancing speech recognition by the use of visual information derived from the face of the talker. This paper demonstrates the use of nonlinear image decomposition, in the form of a 'sieve', applied to the task of visual speech recognition. Information derived from the mouth region is used in visual and audiovisual speech recognition of a database of the letters A-Z for four talkers. A scale histogram is generated directly from the grayscale pixels of … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
19
0

Publication Types

Select...
3
2
2

Relationship

0
7

Authors

Journals

citations
Cited by 28 publications
(19 citation statements)
references
References 16 publications
(20 reference statements)
0
19
0
Order By: Relevance
“…Such improvements have been typically demonstrated on databases of small duration, and, in most cases, limited to a very small number of speakers (mostly less than ten, and often singlesubject) and to small vocabulary tasks [18], [21]. Common tasks typically include recognition of non-sense words [22], [23], isolated words [19], [24][25][26][27][28][29][30], connected digits [31], [32], letters [31], or of closed-set sentences [33], mostly in English, but also in French [22], [34], [35], German [36], [37], and Japanese [38], among others. Recently however, significant improvements have also been demonstrated for large vocabulary continuous speech recognition (LVCSR) [39], as well as cases of speech degraded due to speech impairment [40] or Lombard effects [29].…”
Section: Audio-only Asr Visual-only Asr ( Automatic Speechreadingmentioning
confidence: 99%
See 2 more Smart Citations
“…Such improvements have been typically demonstrated on databases of small duration, and, in most cases, limited to a very small number of speakers (mostly less than ten, and often singlesubject) and to small vocabulary tasks [18], [21]. Common tasks typically include recognition of non-sense words [22], [23], isolated words [19], [24][25][26][27][28][29][30], connected digits [31], [32], letters [31], or of closed-set sentences [33], mostly in English, but also in French [22], [34], [35], German [36], [37], and Japanese [38], among others. Recently however, significant improvements have also been demonstrated for large vocabulary continuous speech recognition (LVCSR) [39], as well as cases of speech degraded due to speech impairment [40] or Lombard effects [29].…”
Section: Audio-only Asr Visual-only Asr ( Automatic Speechreadingmentioning
confidence: 99%
“…1). The first scenario is primarily useful in benchmarking the performance of visual feature extraction algorithms, with visual-only ASR results typically reported on small-vocabulary tasks [24], [25], [28][29][30][31], [36], [40][41][42][43], [46], [59], [66], [78], [84][85][86][87][88][89][90][91][92]. Visual speech modeling is required in this process, its two central aspects being the choice of speech classes, that are assumed to generate the observed features, and the statistical modeling of this generation process.…”
Section: Visual Speech Modeling For Asrmentioning
confidence: 99%
See 1 more Smart Citation
“…The authors report the correct recognition close to 90%. However, based on the AVletters data corpus, Matthews et al (Matthews et al, 1996) reports only a 50% recognition rate. Li et al (Li et al, 1995) reports a perfect recognition 100% on the same task, but two years later in (Li et al, 1997) only 90% recognition.…”
Section: State Of the Art In Lip Readingmentioning
confidence: 99%
“…In all cases, the first and probably the most important step in building a data corpus is to carefully state the targeted applications of the system that will be trained using the dataset. Some of the most cited data corpora for lip reading are: TULIPS1 (Movellan, 1995), AVletters (Matthews et al, 1996), AVOZES (Goecke & Millar, 2004), CUAVE (Patterson et al, 2002), DAVID (Chibelushi et al, 1996), ViaVoice (Neti et al, 2000), DUTAVSC (Wojdel et al, 2002), AVICAR (Lee et al, 2004), AT&T (Potamianos et al, 1997), CMU (Zhang et al, 2002), XM2VTSDB (Messer et al, 1999), M2VTS (Pigeon & Vandendorpe, 1997) and LIUM-AVS (Daubias & Deleglise, 2003). With the exception of M2VTS which is in French, XM2VTSDB which is in four languages and DUTAVSC which is in Dutch the rest are only in English (Table 1).…”
Section: On Building a Data Corpus For Lip Reading: A Comparisonmentioning
confidence: 99%