Audio–Visual Sound Source Localization and Tracking Based on Mobile Robot for The Cocktail Party Problem

Shi, Zhenming; Zhang, Lin; Wang, Dongqing

doi:10.3390/app13106056

Cited by 6 publications

(2 citation statements)

References 43 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Another line focuses on finding a better visual appearance model to track multiple speakers in indoor environments [ 34 ]. At the same time, other proposals are centered on audiovisual tracking in compact configurations (co-located camera and microphone array) for applications such as human-robot interaction [ 17 , 20 , 21 , 35 ].…”

Section: Previous Workmentioning

confidence: 99%

Audiovisual Tracking of Multiple Speakers in Smart Spaces

Sanabria-Macias,

Marron-Romera,

Macias-Guarasa

2023

Sensors

View full text Add to dashboard Cite

This paper presents GAVT, a highly accurate audiovisual 3D tracking system based on particle filters and a probabilistic framework, employing a single camera and a microphone array. Our first contribution is a complex visual appearance model that accurately locates the speaker’s mouth. It transforms a Viola & Jones face detector classifier kernel into a likelihood estimator, leveraging knowledge from multiple classifiers trained for different face poses. Additionally, we propose a mechanism to handle occlusions based on the new likelihood’s dispersion. The audio localization proposal utilizes a probabilistic steered response power, representing cross-correlation functions as Gaussian mixture models. Moreover, to prevent tracker interference, we introduce a novel mechanism for associating Gaussians with speakers. The evaluation is carried out using the AV16.3 and CAV3D databases for Single- and Multiple-Object Tracking tasks (SOT and MOT, respectively). GAVT significantly improves the localization performance over audio-only and video-only modalities, with up to 50.3% average relative improvement in 3D when compared with the video-only modality. When compared to the state of the art, our audiovisual system achieves up to 69.7% average relative improvement for the SOT and MOT tasks in the AV16.3 dataset (2D comparison), and up to 18.1% average relative improvement in the MOT task for the CAV3D dataset (3D comparison).

show abstract

Section: Previous Workmentioning

confidence: 99%

Audiovisual Tracking of Multiple Speakers in Smart Spaces

Sanabria-Macias,

Marron-Romera,

Macias-Guarasa

2023

Sensors

View full text Add to dashboard Cite

show abstract

“…The latency of generating DOA estimates will limit how quickly a humanoid robot can respond to movement of a current talker or orient towards a new talker. Works such as [5,6] consider accurate DOA estimation on robotic systems, but also require a consideration for latency and turn-taking in the context of human-robot conversational scenarios.…”

Section: Introductionmentioning

confidence: 99%

Estimating speaker direction on a humanoid robot with binaural acoustic signals

Barot,

Mombaur,

MacDonald

2024

PLoS ONE

View full text Add to dashboard Cite

To achieve human-like behaviour during speech interactions, it is necessary for a humanoid robot to estimate the location of a human talker. Here, we present a method to optimize the parameters used for the direction of arrival (DOA) estimation, while also considering real-time applications for human-robot interaction scenarios. This method is applied to binaural sound source localization framework on a humanoid robotic head. Real data is collected and annotated for this work. Optimizations are performed via a brute force method and a Bayesian model based method, results are validated and discussed, and effects on latency for real-time use are also explored.

show abstract