2002
DOI: 10.1155/s1110865702207039
|View full text |Cite
|
Sign up to set email alerts
|

A Support Vector Machine-Based Dynamic Network for Visual Speech Recognition Applications

Abstract: Visual speech recognition is an emerging research field. In this paper, we examine the suitability of support vector machines for visual speech recognition. Each word is modeled as a temporal sequence of visemes corresponding to the different phones realized. One support vector machine is trained to recognize each viseme and its output is converted to a posterior probability through a sigmoidal mapping. To model the temporal character of speech, the support vector machines are integrated as nodes into a Viterb… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
20
0

Year Published

2004
2004
2023
2023

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 35 publications
(20 citation statements)
references
References 27 publications
(33 reference statements)
0
20
0
Order By: Relevance
“…Speaker diarization is the problem of determining "who spoke when" based on audio signals [Anguera Miro et al 2012]. The problem of speech recognition based on visual information has been studied by Gordan et al [2002] and Saenko et al [2004;. Visual and audio data are often used together in speaker localization ( [Nock et al 2003;Potamianos et al 2003]).…”
Section: Related Workmentioning
confidence: 99%
“…Speaker diarization is the problem of determining "who spoke when" based on audio signals [Anguera Miro et al 2012]. The problem of speech recognition based on visual information has been studied by Gordan et al [2002] and Saenko et al [2004;. Visual and audio data are often used together in speaker localization ( [Nock et al 2003;Potamianos et al 2003]).…”
Section: Related Workmentioning
confidence: 99%
“…Moreover, by using a method common to both the audio and visual aspects of speech, there is the potential for a more straightforward combination of results obtained from separate audio and visual investigations and such integration has often been carried out using machine learning techniques, such as time delay neural network (TDNN) [42], support vector machines (SVM) [43] and AdaBoost [44].…”
Section: Speech Classification Based On Lip Featuresmentioning
confidence: 99%
“…We used the 450 TIMIT-SX sentences originally designed to provide a good coverage of phonetic contexts of the English language in as few words as possible [17]. Each speaker was asked to read 20 sentences.…”
Section: A11 Linguistic Contentmentioning
confidence: 99%
“…The acoustic models for the forced-path alignment pro-cess were seeded from models generated from the TIMIT corpus [17]. Because the noise level of the AV-TIMIT corpus was higher than that of TIMIT (which was recorded with a noise-canceling closetalking microphone), the initial time-aligned transcriptions were not as accurate as we had desired (as determined by expert visual inspection against spectrograms).…”
Section: A21 Audio Processingmentioning
confidence: 99%