Proceedings of the 13th International Conference on Multimodal Interfaces 2011
DOI: 10.1145/2070481.2070527
|View full text |Cite
|
Sign up to set email alerts
|

Finding audio-visual events in informal social gatherings

Abstract: In this paper we address the problem of detecting and localizing objects that can be both seen and heard, e.g., people. This may be solved within the framework of data clustering. We propose a new multimodal clustering algorithm based on a Gaussian mixture model, where one of the modalities (visual data) is used to supervise the clustering process. This is made possible by mapping both modalities into the same metric space. To this end, we fully exploit the geometric and physical properties of an audio-visual … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
22
0

Year Published

2012
2012
2023
2023

Publication Types

Select...
3
2

Relationship

2
3

Authors

Journals

citations
Cited by 19 publications
(22 citation statements)
references
References 23 publications
0
22
0
Order By: Relevance
“…That is, the number of speakers as well as their positions and their speaking state. In order to reach this goal, we adopted the framework proposed in [2]. Based on a multimodal Gaussian mixture model (mGMM), this method is able to detect and localize audiovisual events from auditory and visual observations.…”
Section: An Audio-visual Fusion Modelmentioning
confidence: 99%
See 4 more Smart Citations
“…That is, the number of speakers as well as their positions and their speaking state. In order to reach this goal, we adopted the framework proposed in [2]. Based on a multimodal Gaussian mixture model (mGMM), this method is able to detect and localize audiovisual events from auditory and visual observations.…”
Section: An Audio-visual Fusion Modelmentioning
confidence: 99%
“…Equation (1) maps 3D points onto the 1D space of ITD observations. The key aspect of our generative audio-visual model [8], [2] is that (1) can be used to map 3D points (visual features) onto the ITD space associated with two microphones, on the premise that the cameras are aligned with the microphones [9]. Hence the fusion between binaural observations and binocular observations is achieved in 1-D.…”
Section: An Audio-visual Fusion Modelmentioning
confidence: 99%
See 3 more Smart Citations