Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96
DOI: 10.1109/icslp.1996.607943
|View full text |Cite
|
Sign up to set email alerts
|

Using the visual component in automatic speech recognition

Abstract: The movements of talkers' faces are known to convey visual cues that can improve speech intelligibility, especially where there is noise or hearing-impairment. This suggests that visible facial gestures could be exploited to enhance speech intelligibility in automatic systems. Handling the volume of data represented by images of talkers' faces implies some form of data compression. Rather than using conventional feature extraction approaches, image coding and compression can be achieved using datadriven, stati… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
7
0

Publication Types

Select...
6
1

Relationship

0
7

Authors

Journals

citations
Cited by 10 publications
(7 citation statements)
references
References 11 publications
0
7
0
Order By: Relevance
“…This was improved by Lan et al [4], where short term temporal information was included in the feature vector. Another possibility is to use geometric information, for example mouth width, area within the inner lip, etc.. [3]. However, this requires accurate tracking of both the inner and outer lip shape, a non-trivial task.…”
Section: Introductionmentioning
confidence: 99%
“…This was improved by Lan et al [4], where short term temporal information was included in the feature vector. Another possibility is to use geometric information, for example mouth width, area within the inner lip, etc.. [3]. However, this requires accurate tracking of both the inner and outer lip shape, a non-trivial task.…”
Section: Introductionmentioning
confidence: 99%
“…Geometric features, appearance features and combined features are commonly used for representing visual information. Geometry-based representations include fiducial points like facial animation parameters [9], contours of lips [10,11,14], shape of jaw and cheek [10,11], and mouth width, mouth opening, oral cavity area and oral cavity perimeter [15]. These methods commonly require accurate and reliable facial and lip feature detection and tracking, which are very difficult to accommodate in practice and even impossible at low image resolution.…”
Section: Introductionmentioning
confidence: 99%
“…In some researches, lipreading combined with face and voice is studied to help biometric identification [1][2][3]. There is also much work focusing on audio-visual speech recognition (AVSR) [4][5][6][7][8][9][10][11][12][13][14][15][16], trying to find effective ways of combining visual information with existing audio-only speech recognition systems (ASR). McGurk effect [17] demonstrates that inconsistency between audio and visual information can result in perceptual confusion.…”
Section: Introductionmentioning
confidence: 99%
“…In contrast, automatic lipreading has been minimally investigated, with the exception of a few pioneering efforts [35,36,37]. Today, we see the emergence of interest in automatic recognition of audiovisual speech, due to several factors.…”
Section: Technological Issuesmentioning
confidence: 99%
“…As discussed by Brooke [37], there are two main areas of research in the design of audio-visual speech recognizers --extraction of optical characteristics and integration of information --both addressing basic questions relevant to the human ability to process speech bimodally. One approach to extracting optical cues uses techniques such as lip contour detection, by analysis of lip luminance and/or chrominance, associated with the adjustment of a lip model or of deformable templates.…”
Section: Technological Issuesmentioning
confidence: 99%