1999
DOI: 10.1109/89.799688
|View full text |Cite
|
Sign up to set email alerts
|

Comparing models for audiovisual fusion in a noisy-vowel recognition task

Abstract: Audiovisual speech recognition involves fusion of the audio and video sensors for phonetic identification. There are three basic ways to fuse data streams for taking a decision such as phoneme identification: data-to-decision, decision-to-decision, and data-to-data. This leads to four possible models for audiovisual speech recognition, that is direct identification in the first case, separate identification in the second one, and two variants of the third early integration case, namely dominant recoding or mot… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
35
0

Year Published

2001
2001
2017
2017

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 74 publications
(37 citation statements)
references
References 40 publications
2
35
0
Order By: Relevance
“…In contrast, shape based feature extraction assumes that most speechreading information is contained in the contours of the speaker's lips, or more generally in the face contours, e.g., jaw and cheek shape, in addition to the lips [48]. Within this category belong geometric type features, such as mouth height, width, and area [19], [22], [26], [28], [29], [32][33][34][35], [49][50][51][52], Fourier and image moment descriptors of the lip contours [28], [53], statistical models of shape, such as active shape models [48], [54], or other parameters of lip-tracking models [44], [55][56][57]. Finally, features from both categories can be concatenated into a joint shape and appearance vector [27], [44], [58], [59], or a joint statistical model can be learned on such vectors, as is the case of the active appearance model [60], used for speechreading in [48].…”
Section: The Visual Front Endmentioning
confidence: 99%
See 1 more Smart Citation
“…In contrast, shape based feature extraction assumes that most speechreading information is contained in the contours of the speaker's lips, or more generally in the face contours, e.g., jaw and cheek shape, in addition to the lips [48]. Within this category belong geometric type features, such as mouth height, width, and area [19], [22], [26], [28], [29], [32][33][34][35], [49][50][51][52], Fourier and image moment descriptors of the lip contours [28], [53], statistical models of shape, such as active shape models [48], [54], or other parameters of lip-tracking models [44], [55][56][57]. Finally, features from both categories can be concatenated into a joint shape and appearance vector [27], [44], [58], [59], or a joint statistical model can be learned on such vectors, as is the case of the active appearance model [60], used for speechreading in [48].…”
Section: The Visual Front Endmentioning
confidence: 99%
“…Various information fusion algorithms have been considered for AV-ASR, differing both in their basic design, as well as in the terminology used [18], [22], [27], [31], [36], [51], [91][92][93][94]. In this paper, we adopt their broad grouping into feature fusion and decision fusion methods.…”
Section: Audio-visual Integration For Asrmentioning
confidence: 99%
“…it is mentioned that to this date this is the largest audio visual database collected and it constitutes the only one suitable task of continuous large vocabulary, speaker independent audio-visual speech recognition, as all other existing databases are limited to small number of subjects and /or small vocabulary tasks. [7], [8], [9], [10], [11], [12], [13], [14], [15]. CUAVE database project has been initiated at Clemson University, for audio-visual experiments.…”
Section: Related Workmentioning
confidence: 99%
“…These approaches belong to three main groups depending on analyzing the extracted data as appearance-based methods including statistical methods (Adjoudani & Benoît, 1996;Bregler & Konig, 1994;Erber et al, 1979) and shape-based methods or geometric approaches (Chiou & Hwang, 1997;Rogozan & Deléglise, 1998;Teissier et al, 1999) and combinations of appearance-based and shapebased methods (Cootes et al, 1998). The necessity for suggesting a globally defined visual phoneme concept suggests expressing of the visual information by mathematical models.…”
Section: Introductionmentioning
confidence: 99%