2021
DOI: 10.1109/lsp.2021.3092959
|View full text |Cite
|
Sign up to set email alerts
|

Three-Dimensional Speaker Localization: Audio-Refined Visual Scaling Factor Estimation

Abstract: Neither a monocular RGB camera nor a smallsize microphone array is capable of accurate three-dimensional (3D) speaker localization. By taking advantage of accurate visual object detection, and audio-visual complementary sensor fusion, we formulate the three-dimensional (3D) speaker localization problem as a visual scaling factor estimation problem. As a result, we effectively reduce the traditional audio-only 3D speaker localization from an exhaustive grid search to a one-dimensional (1D) optimization problem.… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2

Citation Types

0
2
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
4
2

Relationship

0
6

Authors

Journals

citations
Cited by 6 publications
(2 citation statements)
references
References 25 publications
0
2
0
Order By: Relevance
“…Given the increased availability of easy-to-deploy audio and video sensors and the improvements in computing facilities, in recent years, there has been a relevant growth in the number of proposals for multiple speakers tracking (Multiple-Object Tracking, MOT) in smart spaces, combining audio and video information [ 17 , 19 , 20 , 21 , 22 , 23 , 24 , 25 , 26 , 27 , 28 , 29 , 30 , 31 , 32 ]. Qian et al in [ 17 ] conducted an extensive review of state of the art in audiovisual speaker tracking.…”
Section: Previous Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Given the increased availability of easy-to-deploy audio and video sensors and the improvements in computing facilities, in recent years, there has been a relevant growth in the number of proposals for multiple speakers tracking (Multiple-Object Tracking, MOT) in smart spaces, combining audio and video information [ 17 , 19 , 20 , 21 , 22 , 23 , 24 , 25 , 26 , 27 , 28 , 29 , 30 , 31 , 32 ]. Qian et al in [ 17 ] conducted an extensive review of state of the art in audiovisual speaker tracking.…”
Section: Previous Workmentioning
confidence: 99%
“…The most commonly used approach is based on an observation model that keeps the Bayesian scheme. There are many examples of face detectors based on deep learning [ 25 , 26 , 27 , 28 , 29 , 31 ], although Siamese networks have also been used to generate measures of particle similarity to previous reference images of each target [ 32 , 39 ], and fusion models based on the attention mechanism [ 32 ]. Fewer proposals we found with end-to-end trained audiovisual solutions as in [ 24 , 40 ] for object tracking in which visual and auditory inputs are fused by an added fusion layer.…”
Section: Previous Workmentioning
confidence: 99%