2018 IEEE International Conference on Robotics and Automation (ICRA) 2018
DOI: 10.1109/icra.2018.8461267
|View full text |Cite
|
Sign up to set email alerts
|

Deep Neural Networks for Multiple Speaker Detection and Localization

Abstract: We propose to use neural networks for simultaneous detection and localization of multiple sound sources in human-robot interaction. In contrast to conventional signal processing techniques, neural network-based sound source localization methods require fewer strong assumptions about the environment. Previous neural network-based methods have been focusing on localizing a single sound source, which do not extend to multiple sources in terms of detection and localization.In this paper, we thus propose a likeliho… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
173
0
1

Year Published

2018
2018
2024
2024

Publication Types

Select...
4
2
2
1

Relationship

0
9

Authors

Journals

citations
Cited by 148 publications
(175 citation statements)
references
References 17 publications
1
173
0
1
Order By: Relevance
“…All these methods estimate DOAs for static point sources and were shown to perform equally or better than the parametric methods in reverberant scenarios. Further, methods [4,18,20,25] proposed to simultaneously detect DOAs of overlapping sound events by estimating the number of active sources from the data itself. Most methods used a classification approach, thereby estimating the source presence likelihood at a fixed set of angles, while [22,23] used a regression approach and let the DNN produce continuous output.…”
Section: B Sound Source Localizationmentioning
confidence: 99%
See 1 more Smart Citation
“…All these methods estimate DOAs for static point sources and were shown to perform equally or better than the parametric methods in reverberant scenarios. Further, methods [4,18,20,25] proposed to simultaneously detect DOAs of overlapping sound events by estimating the number of active sources from the data itself. Most methods used a classification approach, thereby estimating the source presence likelihood at a fixed set of angles, while [22,23] used a regression approach and let the DNN produce continuous output.…”
Section: B Sound Source Localizationmentioning
confidence: 99%
“…Most of the methods estimated full azimuth ('Full' in Table I) using microphones mounted on a robot, circular and distributed arrays, while the rest of the methods used linear arrays thereby estimating only the azimuth angles in a range of 180°. Although few of the existing methods estimated the azimuth and elevation jointly [24,25], most of them estimated only the azimuth angle [1][2][3][4][17][18][19][20]. In particular, we studied the joint estimation of azimuth and elevation angles in [25], this was enabled by the use of Ambisonic signals (FOA) obtained using a spherical array.…”
Section: B Sound Source Localizationmentioning
confidence: 99%
“…An early work in multimodal ASD used TDNN [7], and there has been significant recent work using DNN-based ASD (e.g., [8], [9], [10], [11], [12], [13], [14]). There is now a large dataset created for this task [15] with an ASD competition [10].…”
Section: Related Workmentioning
confidence: 99%
“…Finally, in what respect to the experimental setup, most works use simulated data either for training or for training and testing [44][45][46][47][48][49][50][51][52][54][55][56][57][58][59], usually by convolving clean (anechoic) speech with impulse responses (room, head related, or DOA related (azimuth, elevation)). Only some of them actually face real recordings [44,45,53,55,56], which in our opinion is a must to be able to assess the actual impact of the proposals in real conditions. So, in this paper we describe, for the first time in the literature to the best of our knowledge, a CNN architecture in which we directly exploit the raw acoustic signal to be provided to the neural network, with the objective of directly estimating the three dimensional position of an acoustic source in a given environment.…”
Section: State Of the Artmentioning
confidence: 99%