Multiple Speaker Tracking in Spatial Audio via PHD Filtering and Depth-Audio Fusion

Audio-visual tracking of an unknown number of concurrent speakers in 3D is a challenging task, especially when sound and video are collected with a compact sensing platform. In this paper, we propose a tracker that builds on generative and discriminative audio-visual likelihood models formulated in a particle filtering framework. We localize multiple concurrent speakers with a de-emphasized acoustic map assisted by the image detection-derived 3D video observations. The 3D multimodal observations are either assigned to existing tracks for discriminative likelihood computation or used to initialize new tracks. The generative likelihoods rely on color distribution of the target and the de-emphasized acoustic map. Experiments on AV16.3 and CAV3D datasets show that the proposed tracker outperforms the uni-modal trackers and the state-of-the-art approaches both in 3D and on the image plane.

Section: Audio Observation -Single Speakermentioning

confidence: 99%

Audio-Visual Tracking of Concurrent Speakers

Qian

Brutti

Lanz

et al. 2022

“…This problem has been solved with accurate calibration and rectification. Various inexpensive offthe-shelf 360 cameras with two fish-eye lenses have recently become popular 3,4,5 .…”

Section: A Approximated Room Geometry Reconstructionmentioning

confidence: 99%

“…Audio and image processing have been investigated as separate research areas, typically ignoring their synergy when they work together. Recently, some works have been proposed to exploit their multimodal information, for applications such as speaker tracking [4], speech recognition [5], and event detection [6]. In this paper, we apply computer vision techniques to support audio reproduction adapted to the acoustics of a specific location.…”

Section: Introductionmentioning

confidence: 99%

Acoustic Room Modelling Using 360 Stereo Cameras

Kim

Remaggi

Fowler

et al. 2021

Self Cite

In this paper we propose a pipeline for estimating acoustic 3D room structure with geometry and attribute prediction using spherical 360 • cameras. Instead of setting microphone arrays with loudspeakers to measure acoustic parameters for specific rooms, a simple and practical single-shot capture of the scene using a stereo pair of 360 cameras can be used to simulate those acoustic parameters. We assume that the room and objects can be represented as cuboids aligned to the main axes of the room coordinate (Manhattan world). The scene is captured as a stereo pair using off-the-shelf consumer spherical 360 cameras. A cuboid-based 3D room geometry model is estimated by correspondence matching between captured images and semantic labelling using a convolutional neural network (SegNet). The estimated geometry is used to produce frequency-dependent acoustic predictions of the scene. This is, to our knowledge, the first attempt in the literature to use visual geometry estimation and object classification algorithms to predict acoustic properties. Results are compared to measurements through calculated reverberant spatial audio object parameters used for reverberation reproduction customized to the given loudspeaker set up.

“…mainly exploited by the signal processing community. GM-PHD and Sequential Monte Carlo (SMC)-PHD filters are two commonly used implementations in this theory, as they have been able to generate convincing tracking performance in video-based multi-target tracking [2], [3], [5], [7], [15]- [17]. This is attributed to the advantages of PHD filtering methods, as they have the ability to deal with varying number of targets, and also provide the estimates in both cardinality and localization with relatively low computational cost [2].…”

Section: Introductionmentioning

confidence: 99%

Multi-Level Cooperative Fusion of GM-PHD Filters for Online Multiple Human Tracking

Angelini

Chambers

et al. 2019

In this paper, we propose a multi-level cooperative fusion approach to address the online multiple human tracking problem in a Gaussian Mixture Probability Hypothesis Density (GM-PHD) filter framework. The proposed fusion approach consists essentially of three steps. Firstly, we integrate two human detectors with different characteristics (full-body and bodyparts), and investigate their complementary benefits for tracking multiple targets. For each detector domain, we then propose a novel Discriminative Correlation Matching (DCM) model, and fuse it with spatio-temporal information to address ambiguous identity association in the GM-PHD filter. Finally, we develop a robust fusion center with virtual and real zones to make a global decision based on preliminary candidate targets generated by each detector. This center also mitigates the sensitivity of missed detections in the Generalized Covariance Intersection (GCI) fusion process, thereby improving the fusion performance and tracking consistency. Experiments on the MOTChallenge Benchmark demonstrate the proposed method achieves improved performance over other state-of-the-art RFS based tracking methods.