Proceedings of the 10th International Conference on Multimodal Interfaces 2008
DOI: 10.1145/1452392.1452438
|View full text |Cite
|
Sign up to set email alerts
|

Detection and localization of 3d audio-visual objects using unsupervised clustering

Abstract: This paper addresses the issues of detecting and localizing objects in a scene that are both seen and heard. We explain the benefits of a human-like configuration of sensors (binaural and binocular) for gathering auditory and visual observations. It is shown that the detection and localization problem can be recast as the task of clustering the audio-visual observations into coherent groups. We propose a probabilistic generative model that captures the relations between audio and visual observations. This mode… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
7
0

Year Published

2008
2008
2015
2015

Publication Types

Select...
5

Relationship

4
1

Authors

Journals

citations
Cited by 6 publications
(7 citation statements)
references
References 24 publications
0
7
0
Order By: Relevance
“…In the case of one camera and one microphone, spatial alignment is not possible and methods using this minimal sensor configuration work well only if it is assumed a perfect temporal alignment between the image sequence and the one-dimensional acoustic signal [14]. However, methods using just one camera do not permit to take full advantage of three-dimensional audio-visual event localization which has been proved to be very useful for the detection and localization of multiple speakers [1], [5] or for sound-source separation [15]. Moreover, as already explained, we note that the temporal alignment assumption is not at all realistic.…”
Section: A Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…In the case of one camera and one microphone, spatial alignment is not possible and methods using this minimal sensor configuration work well only if it is assumed a perfect temporal alignment between the image sequence and the one-dimensional acoustic signal [14]. However, methods using just one camera do not permit to take full advantage of three-dimensional audio-visual event localization which has been proved to be very useful for the detection and localization of multiple speakers [1], [5] or for sound-source separation [15]. Moreover, as already explained, we note that the temporal alignment assumption is not at all realistic.…”
Section: A Related Workmentioning
confidence: 99%
“…INTRODUCTION Audiovisual (AV) scene analysis has become an increasingly popular research topic during the past years due to many useful applications: human-robot interaction [1], multimodal interfaces [2], audio-visual tracking [3], [4], object localization [5], etc. Various attempts to build computational paradigms for AV scene analysis consider the issue of integration as the cornerstone of the approaches.…”
mentioning
confidence: 99%
“…The task of simultaneous detection and 3D localization using multimodal data has also been addressed in [20,21]. The authors propose a probabilistic framework based on a conjugate GMM.…”
Section: Related Workmentioning
confidence: 99%
“…In this section, we present some useful tools, and illustrate some low level A/V cues that may be used to exploit the data. For an example of an AV localisation algorithm tested on the CAVA database, we refer to [14].…”
Section: Tools For Data Exploitationmentioning
confidence: 99%