Detection and localization of 3d audio-visual objects using unsupervised clustering

Khalidov, Vasil; Forbes, Florence; Hansard, Miles; Arnaud, Élise; Horaud, Radu

doi:10.1145/1452392.1452438

Cited by 6 publications

(7 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In the case of one camera and one microphone, spatial alignment is not possible and methods using this minimal sensor configuration work well only if it is assumed a perfect temporal alignment between the image sequence and the one-dimensional acoustic signal [14]. However, methods using just one camera do not permit to take full advantage of three-dimensional audio-visual event localization which has been proved to be very useful for the detection and localization of multiple speakers [1], [5] or for sound-source separation [15]. Moreover, as already explained, we note that the temporal alignment assumption is not at all realistic.…”

Section: A Related Workmentioning

confidence: 99%

“…INTRODUCTION Audiovisual (AV) scene analysis has become an increasingly popular research topic during the past years due to many useful applications: human-robot interaction [1], multimodal interfaces [2], audio-visual tracking [3], [4], object localization [5], etc. Various attempts to build computational paradigms for AV scene analysis consider the issue of integration as the cornerstone of the approaches.…”

mentioning

confidence: 99%

See 1 more Smart Citation

Alignment of binocular-binaural data using a moving audio-visual target

Khalidov

Forbes

Horaud

2013

2013 IEEE 15th International Workshop on Multimedia Signal Processing (MMSP)

Self Cite

View full text Add to dashboard Cite

Abstract-In this paper we address the problem of aligning visual (V) and auditory (A) data using a sensor that is composed of a camera-pair and a microphone-pair. The original contribution of the paper is a method for AV data aligning through estimation of the 3D positions of the microphones in the visualcentred coordinate frame defined by the stereo camera-pair. We exploit the fact that these two distinct data sets are conditioned by a common set of parameters, namely the (unknown) 3D trajectory of an AV object, and derive an EM-like algorithm that alternates between the estimation of the microphone-pair position and the estimation of the AV object trajectory. The proposed algorithm has a number of built-in features: it can deal with A and V observations that are misaligned in time, it estimates the reliability of the data, it is robust to outliers in both modalities, and it has proven theoretical convergence. We report experiments with both simulated and real data.

show abstract

Section: A Related Workmentioning

confidence: 99%

mentioning

confidence: 99%

Alignment of binocular-binaural data using a moving audio-visual target

Khalidov

Forbes

Horaud

2013

2013 IEEE 15th International Workshop on Multimedia Signal Processing (MMSP)

Self Cite

View full text Add to dashboard Cite

show abstract

“…The task of simultaneous detection and 3D localization using multimodal data has also been addressed in [20,21]. The authors propose a probabilistic framework based on a conjugate GMM.…”

Section: Related Workmentioning

confidence: 99%

Finding audio-visual events in informal social gatherings

Alameda-Pineda

Khalidov

Horaud

et al. 2011

Proceedings of the 13th International Conference on Multimodal Interfaces

Self Cite

View full text Add to dashboard Cite

In this paper we address the problem of detecting and localizing objects that can be both seen and heard, e.g., people. This may be solved within the framework of data clustering. We propose a new multimodal clustering algorithm based on a Gaussian mixture model, where one of the modalities (visual data) is used to supervise the clustering process. This is made possible by mapping both modalities into the same metric space. To this end, we fully exploit the geometric and physical properties of an audio-visual sensor based on binocular vision and binaural hearing. We propose an EM algorithm that is theoretically well justified, intuitive, and extremely efficient from a computational point of view. This efficiency makes the method implementable on advanced platforms such as humanoid robots. We describe in detail tests and experiments performed with publicly available data sets that yield very interesting results.

show abstract

“…In this section, we present some useful tools, and illustrate some low level A/V cues that may be used to exploit the data. For an example of an AV localisation algorithm tested on the CAVA database, we refer to [14].…”

Section: Tools For Data Exploitationmentioning

confidence: 99%

The CAVA corpus

Arnaud

Christensen

et al. 2008

Proceedings of the 10th International Conference on Multimodal Interfaces

Self Cite

View full text Add to dashboard Cite

This paper describes the acquisition and content of a new multi-modal database. Some tools for making use of the data streams are also presented. The Computational AudioVisual Analysis (CAVA) database is a unique collection of three synchronised data streams obtained from a binaural microphone pair, a stereoscopic camera pair and a head tracking device. All recordings are made from the perspective of a person; i.e. what would a human with natural head movements see and hear in a given environment. The database is intended to facilitate research into humans' ability to optimise their multi-modal sensory input and fills a gap by providing data that enables human centred audiovisual scene analysis. It also enables 3D localisation using either audio, visual, or audiovisual cues. A total of 50 sessions, with varying degrees of visual and auditory complexity, were recorded. These range from seeing and hearing a single speaker moving in and out of field of view, to moving around a 'cocktail party' style situation, mingling and joining different small groups of people chatting.

show abstract

Detection and localization of 3d audio-visual objects using unsupervised clustering

Cited by 6 publications

References 24 publications

Alignment of binocular-binaural data using a moving audio-visual target

Alignment of binocular-binaural data using a moving audio-visual target

Finding audio-visual events in informal social gatherings

The CAVA corpus

Contact Info

Product

Resources

About