Online multimodal speaker detection for humanoid robots

Sanchez-Riera, Jordi; Alameda-Pineda, Xavier; Wienke, Johannes; Deleforge, Antoine; Arias, Soraya; Čech, Jan; Wrede, Sebastian; Horaud, Radu

doi:10.1109/humanoids.2012.6651509

Cited by 10 publications

(13 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Similarly to the approach in [16], the proposed audiovisual fusion model relies on 3D visual features. Since the task is to localize speakers and to estimate their speaking activity status, ideally one would like to find 3D lips/mouth locations and to combine these locations with 2D sound source locations.…”

Section: B Audio-visual Associationmentioning

confidence: 99%

“…If this is the case, then the covariance matrices would be speaker/direction specific and should be estimated online during a short period of time during which several directions are collected. Then the GMM's parameters would be estimated via an EM procedure, e.g., by extending the model [16] from 1D to 2D sound localization. Such an approach would have required enough independent samples.…”

Section: B Audio-visual Associationmentioning

confidence: 99%

“…The algorithm outputs the posterior probability of the speaking state over the 3D-localized candidate speakers. While this association is similar in spirit to the fusion model in [16], the novel method doesn't need an EM procedure at runtime and yields accurate and discriminative two-dimensional audiovisual localization and detection.…”

Section: Introductionmentioning

confidence: 98%

See 2 more Smart Citations

Active-speaker detection and localization with microphones and cameras embedded into a robotic head

Čech

Mittal

Deleforge

et al. 2013

2013 13th IEEE-RAS International Conference on Humanoid Robots (Humanoids)

Self Cite

View full text Add to dashboard Cite

Abstract-In this paper we present a method for detecting and localizing an active speaker, i.e., a speaker that emits a sound, through the fusion between visual reconstruction with a stereoscopic camera pair and sound-source localization with several microphones. Both the cameras and the microphones are embedded into the head of a humanoid robot. The proposed statistical fusion model associates 3D faces of potential speakers with 2D sound directions. The paper has two contributions: (i) a method that discretizes the two-dimensional space of all possible sound directions and that accumulates evidence for each direction by estimating the time difference of arrival (TDOA) over all the microphone pairs, such that all the microphones are used simultaneously and symmetrically and (ii) an audio-visual alignment method that maps 3D visual features onto 2D sound directions and onto TDOAs between microphone pairs. This allows to implicitly represent both sensing modalities into a common audiovisual coordinate frame. Using simulated as well as real data, we quantitatively assess the robustness of the method against noise and reverberations, and we compare it with several other methods. Finally, we describe a realtime implementation using the proposed technique and with a humanoid head embedding four microphones and two cameras: this enables natural human-robot interactive behavior.

show abstract

Section: B Audio-visual Associationmentioning

confidence: 99%

Section: B Audio-visual Associationmentioning

confidence: 99%

See 1 more Smart Citation

Active-speaker detection and localization with microphones and cameras embedded into a robotic head

Čech

Mittal

Deleforge

et al. 2013

2013 13th IEEE-RAS International Conference on Humanoid Robots (Humanoids)

Self Cite

View full text Add to dashboard Cite

show abstract

“…In this paper, a geometric transformation [3] is therefore applied to the DOAs in order to map the source directions from spherical space to a pixel position on the image plane. Therefore, auditory DOAs, visual facial detections, and the desired system states can be treated in the same mathematical space.…”

Section: System Modelmentioning

confidence: 99%

“…Recent contributions in the audio-visual community therefore utilize features extracted from images to complement audio processing tasks and vice versa. The attention control system for mobile robots in [2,3] uses separate but parallel audio and visual processing subsystems to identify salient events. In [4], a visual tracker is used to estimate the positions and velocities of people for blind source separation.…”

Section: Introductionmentioning

confidence: 99%

Audio-visual tracking by density approximation in a sequential Bayesian filtering framework

Gebru

Evers

Naylor

et al. 2017

2017 Hands-Free Speech Communications and Microphone Arrays (HSCMA)

Self Cite

View full text Add to dashboard Cite

This paper proposes a novel audio-visual tracking approach that exploits constructively audio and visual modalities in order to estimate trajectories of multiple people in a joint state space. The tracking problem is modeled using a sequential Bayesian filtering framework. Within this framework, we propose to represent the posterior density with a Gaussian Mixture Model (GMM). To ensure that a GMM representation can be retained sequentially over time, the predictive density is approximated by a GMM using the Unscented Transform. While a density interpolation technique is introduced to obtain a continuous representation of the observation likelihood, which is also a GMM. Furthermore, to prevent the number of mixtures from growing exponentially over time, a density approximation based on the Expectation Maximization (EM) algorithm is applied, resulting in a compact GMM representation of the posterior density. Recordings using a camcorder and microphone array are used to evaluate the proposed approach, demonstrating significant improvements in tracking performance of the proposed audio-visual approach compared to two benchmark visual trackers.

show abstract