2005
DOI: 10.1007/11608288_66
|View full text |Cite
|
Sign up to set email alerts
|

Multi-level Fusion of Audio and Visual Features for Speaker Identification

Abstract: This paper explores the fusion of audio and visual evidences through a multi-level hybrid fusion architecture based on dynamic Bayesian network (DBN), which combines model level and decision level fusion to achieve higher performance. In model level fusion, a new audiovisual correlative model (AVCM) based on DBN is proposed, which describes both the intercorrelations and loose timing synchronicity between the audio and video streams. The experiments on the CMU database and our own homegrown database both demon… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
38
0
1

Year Published

2014
2014
2023
2023

Publication Types

Select...
6
1

Relationship

1
6

Authors

Journals

citations
Cited by 44 publications
(39 citation statements)
references
References 9 publications
0
38
0
1
Order By: Relevance
“…Кроме того, существует и другой способ, который находится между ранней и поздней интеграцией и называется промежуточной интегра-цией (в некоторых источниках его относят к ранней интеграции). Также можно комплексировать два спо-соба интеграции, выполняя объединение одновременно на двух уровнях, что называют гибридным под-ходом [52]. Далее эти способы описаны более подробно с анализом их преимуществ и недостатков.…”
Section: способы объединения аудиовизуальной информацииunclassified
“…Кроме того, существует и другой способ, который находится между ранней и поздней интеграцией и называется промежуточной интегра-цией (в некоторых источниках его относят к ранней интеграции). Также можно комплексировать два спо-соба интеграции, выполняя объединение одновременно на двух уровнях, что называют гибридным под-ходом [52]. Далее эти способы описаны более подробно с анализом их преимуществ и недостатков.…”
Section: способы объединения аудиовизуальной информацииunclassified
“…However, proper fusion of multiple biometrics remains a challenging and crucial task. Amongst the various multimodal biometric methods reported so far, Audio-Visual person authentication [3][4][5][6][7][8][17] offers some more unique advantages. First of all, it uses two biometrics (speech and face image), which people share quite comfortably in everyday life.…”
Section: Introductionmentioning
confidence: 99%
“…A few recent methods (e.g. [8]) proposed a feature level fusion as well. For the speech mode, various text-independent speaker recognition methods, using GMM [11] or VQ [10], are predominantly used.…”
Section: Introductionmentioning
confidence: 99%
“…It is generally agreed that the model level fusion gives better performance because it can capture the potentially useful coupling or conditional dependence between audio visual modalities [2][3][4][5]. However, in a very noisy environment, the performance of model level fusion may not be as good as that of decision level fusion [2,5].…”
Section: Introductionmentioning
confidence: 99%
“…It combines model level and decision level fusion to achieve improved performance [5]. In such a strategy, the fusion weights are of great importance as they must capture the reliability of inputs which may vary dynamically.…”
Section: Introductionmentioning
confidence: 99%