2003
DOI: 10.1109/jproc.2003.817150
|View full text |Cite
|
Sign up to set email alerts
|

Recent advances in the automatic recognition of audiovisual speech

Abstract: Abstract-Visual speech information from the speaker's mouth region has been successfully shown to improve noise robustness of automatic speech recognizers, thus promising to extend their usability into the human computer interface. In this paper, we review the main components of audio-visual automatic speech recognition and present novel contributions in two main areas: First, the visual front end design, based on a cascade of linear image transforms of an appropriate video region-of-interest, and subsequently… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

3
456
1
6

Year Published

2005
2005
2017
2017

Publication Types

Select...
7
2

Relationship

0
9

Authors

Journals

citations
Cited by 594 publications
(466 citation statements)
references
References 120 publications
3
456
1
6
Order By: Relevance
“…Typically, this is done by applying a transformation such as the discrete cosine transform (DCT) and/or a dimensionality reduction technique such as the linear discriminant analysis (LDA) to the ROI, possibly in combination with a principle component analysis (PCA) or a maximum-likelihood linear transform (MLLT) [23]. A common feature post-processing technique involves a chain of LDAs and MLLTs on concatenated frames, the so-called HiLDA [22].…”
Section: Related Workmentioning
confidence: 99%
“…Typically, this is done by applying a transformation such as the discrete cosine transform (DCT) and/or a dimensionality reduction technique such as the linear discriminant analysis (LDA) to the ROI, possibly in combination with a principle component analysis (PCA) or a maximum-likelihood linear transform (MLLT) [23]. A common feature post-processing technique involves a chain of LDAs and MLLTs on concatenated frames, the so-called HiLDA [22].…”
Section: Related Workmentioning
confidence: 99%
“…Lip visual features are generally grouped into three categories [31][32][33]: (a) appearance-based features; (b) shape-based features and (c) a combination of both appearance and shape features.…”
Section: Lip Visual Featuresmentioning
confidence: 99%
“…Since only a small part of the vocal tract is visible when we speak, only partial physical information is available regarding the generation of visemes and not all can be mapped to a unique phoneme [32], the basic unit of speech in the audio domain.…”
Section: Speech Classification Based On Lip Featuresmentioning
confidence: 99%
“…Lan et al [8] achieve an accuracy of 45% on their challenging 12 speakers audio-visual corpus. A good overview of the field is given in [14] and [15]. Neti et al [16] present audio-visual but also visual only recognition results.…”
Section: State Of the Artmentioning
confidence: 99%