2012 12th IEEE-RAS International Conference on Humanoid Robots (Humanoids 2012) 2012
DOI: 10.1109/humanoids.2012.6651509
|View full text |Cite
|
Sign up to set email alerts
|

Online multimodal speaker detection for humanoid robots

Abstract: Abstract-In this paper we address the problem of audiovisual speaker detection. We introduce an online system working on the humanoid robot NAO. The scene is perceived with two cameras and two microphones. A multimodal Gaussian mixture model (mGMM) fuses the information extracted from the auditory and visual sensors and detects the most probable audio-visual object, e.g., a person emitting a sound, in the 3D space. The system is implemented on top of a platformindependent middleware and it is able to process t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
12
0

Year Published

2013
2013
2019
2019

Publication Types

Select...
4
2

Relationship

4
2

Authors

Journals

citations
Cited by 10 publications
(13 citation statements)
references
References 17 publications
1
12
0
Order By: Relevance
“…Similarly to the approach in [16], the proposed audiovisual fusion model relies on 3D visual features. Since the task is to localize speakers and to estimate their speaking activity status, ideally one would like to find 3D lips/mouth locations and to combine these locations with 2D sound source locations.…”
Section: B Audio-visual Associationmentioning
confidence: 99%
See 2 more Smart Citations
“…Similarly to the approach in [16], the proposed audiovisual fusion model relies on 3D visual features. Since the task is to localize speakers and to estimate their speaking activity status, ideally one would like to find 3D lips/mouth locations and to combine these locations with 2D sound source locations.…”
Section: B Audio-visual Associationmentioning
confidence: 99%
“…If this is the case, then the covariance matrices would be speaker/direction specific and should be estimated online during a short period of time during which several directions are collected. Then the GMM's parameters would be estimated via an EM procedure, e.g., by extending the model [16] from 1D to 2D sound localization. Such an approach would have required enough independent samples.…”
Section: B Audio-visual Associationmentioning
confidence: 99%
See 1 more Smart Citation
“…In this paper, a geometric transformation [3] is therefore applied to the DOAs in order to map the source directions from spherical space to a pixel position on the image plane. Therefore, auditory DOAs, visual facial detections, and the desired system states can be treated in the same mathematical space.…”
Section: System Modelmentioning
confidence: 99%
“…Recent contributions in the audio-visual community therefore utilize features extracted from images to complement audio processing tasks and vice versa. The attention control system for mobile robots in [2,3] uses separate but parallel audio and visual processing subsystems to identify salient events. In [4], a visual tracker is used to estimate the positions and velocities of people for blind source separation.…”
Section: Introductionmentioning
confidence: 99%