2007
DOI: 10.1109/tc.2007.1074
|View full text |Cite
|
Sign up to set email alerts
|

Synergy of Lip-Motion and Acoustic Features in Biometric Speech and Speaker Recognition

Abstract: Abstract-This paper presents the scheme and evaluation of a robust audio-visual digit-and-speaker-recognition system using lip motion and speech biometrics. Moreover, a liveness verification barrier based on a person's lip movement is added to the system to guard against advanced spoofing attempts such as replayed videos. The acoustic and visual features are integrated at the feature level and evaluated first by a Support Vector Machine for digit and speaker identification and, then, by a Gaussian Mixture Mode… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
11
0
1

Year Published

2008
2008
2022
2022

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 29 publications
(12 citation statements)
references
References 38 publications
0
11
0
1
Order By: Relevance
“…Manuscript Furthermore, an extension of the proposed algorithm for the efficient detection of the active speaker(s) in a multi-person environment is introduced and tested in this letter. The main research topic in the area of speech analysis using visual information is automatic visual or audio-visual speech recognition [2]- [4]. However, only a few works [6], [7] address the same problem as in our work, i.e., characterizing the frames of a video sequence as containing speaking persons or not using only visual information.…”
Section: Introductionmentioning
confidence: 94%
See 1 more Smart Citation
“…Manuscript Furthermore, an extension of the proposed algorithm for the efficient detection of the active speaker(s) in a multi-person environment is introduced and tested in this letter. The main research topic in the area of speech analysis using visual information is automatic visual or audio-visual speech recognition [2]- [4]. However, only a few works [6], [7] address the same problem as in our work, i.e., characterizing the frames of a video sequence as containing speaking persons or not using only visual information.…”
Section: Introductionmentioning
confidence: 94%
“…From our modelling assumptions, under and under , where and denote the all-zero and all-one vectors respectively and denotes the identity matrix. By substituting these density functions in (2), the likelihood ratio becomes (4) We then compute the log-likelihood ratio by taking the logarithm of (4), and we incorporate the non-data terms (i.e., the terms of the sum that are not related to ) in the threshold. Thus, the following expression results:…”
Section: A Statistical Frameworkmentioning
confidence: 99%
“…In the visual speech domain, existing publicly available digital databases such as XM2VTS [8] and MVGL [14], have been widely utilized for multi-modal speech recognition and speaker identification. However, these databases are more or less incompetent for the evaluation of the lip-password based speaker verification problem.…”
Section: Resultsmentioning
confidence: 99%
“…Lip-reading systems that use pixel-based features assume that the pixel values around the mouth area contain salient speech information [3]. Other lip-reading methods using active contours to encode lip-shapes [5] and constrained lip-motion statistics [6] have also been reported in the literature. This paper proposes a lip-reading technique using spatio-temporal templates (STT).…”
Section: Introductionmentioning
confidence: 99%