2021
DOI: 10.1109/tpami.2019.2953020
|View full text |Cite
|
Sign up to set email alerts
|

Variational Bayesian Inference for Audio-Visual Tracking of Multiple Speakers

Abstract: In this paper we address the problem of tracking multiple speakers via the fusion of visual and auditory information. We propose to exploit the complementary nature of these two modalities in order to accurately estimate smooth trajectories of the tracked persons, to deal with the partial or total absence of one of the modalities over short periods of time, and to estimate the acoustic status -either speaking or silent -of each tracked person along time. We propose to cast the problem at hand into a generative… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
49
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
3
1

Relationship

3
6

Authors

Journals

citations
Cited by 43 publications
(49 citation statements)
references
References 38 publications
0
49
0
Order By: Relevance
“…A diarization system using only face identification and SSL may be regarded 5 Note that SA-WER used here is different from SWER of [2]. as a baseline as this approach was widely used in previous audiovisual diarization studies [34][35][36]. The results show that the use of speaker identification substantially improved the speaker attribution accuracy.…”
Section: Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…A diarization system using only face identification and SSL may be regarded 5 Note that SA-WER used here is different from SWER of [2]. as a baseline as this approach was widely used in previous audiovisual diarization studies [34][35][36]. The results show that the use of speaker identification substantially improved the speaker attribution accuracy.…”
Section: Resultsmentioning
confidence: 99%
“…[33,34] do not cope with speech overlaps. While the methods proposed in [35,36] address the overlap issue, they rely solely on spatial cues and thus are not applicable when multiple speakers sit side by side.…”
Section: Introductionmentioning
confidence: 99%
“…The present work is also related to previous work in localizing sounds in visual inputs [20,14,22,9,8,24,4,35], which aims to identify which pixels in a video are associated with an object making a particular sound.…”
Section: Sound Localizationmentioning
confidence: 97%
“…In this section we derive the expression for τ j by computing the integral (17). Using the probabilistic model defined, we can write (the index j is omitted): We will first marginalize s t−L .…”
Section: Appendix D Derivation Of the Birth Probabilitymentioning
confidence: 99%