Interspeech 2018 2018
DOI: 10.21437/interspeech.2018-1269
|View full text |Cite
|
Sign up to set email alerts
|

Joint Localization and Classification of Multiple Sound Sources Using a Multi-task Neural Network

Abstract: We propose a novel multi-task neural network-based approach for joint sound source localization and speech/non-speech classification in noisy environments. The network takes raw short time Fourier transform as input and outputs the likelihood values for the two tasks, which are used for the simultaneous detection, localization and classification of an unknown number of overlapping sound sources, Tested with real recorded data, our method achieves significantly better performance in terms of speech/non-speech c… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
19
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
6
3

Relationship

0
9

Authors

Journals

citations
Cited by 32 publications
(19 citation statements)
references
References 17 publications
0
19
0
Order By: Relevance
“…They have been used with various inputs: binaural features [24], GCC features [25], the eigenvectors of the spatial covariance matrix [26], raw short-time Fourier transform (STFT) of signals [27]- [29], including for Ambisonics signals in [29]. Different architectures have been tested: feed-forward neural networks [24], convolutional neural networks (CNNs) [27], [30], deep residual networks [31], convolutional and recurrent networks (CRNNs) [29]. Yet, most of these methods have only been evaluated in simulated environments similar to the training conditions, which is not sufficient to verify their generalization to real-life applications.…”
Section: Introductionmentioning
confidence: 99%
“…They have been used with various inputs: binaural features [24], GCC features [25], the eigenvectors of the spatial covariance matrix [26], raw short-time Fourier transform (STFT) of signals [27]- [29], including for Ambisonics signals in [29]. Different architectures have been tested: feed-forward neural networks [24], convolutional neural networks (CNNs) [27], [30], deep residual networks [31], convolutional and recurrent networks (CRNNs) [29]. Yet, most of these methods have only been evaluated in simulated environments similar to the training conditions, which is not sufficient to verify their generalization to real-life applications.…”
Section: Introductionmentioning
confidence: 99%
“…Most often, the number of the sound sources desired to be localized is known or set in advance, although localization of an a-priori unknown number of sources was also demonstrated (Cao et al 2021;Chazan et al 2019;He et al 2018b).…”
Section: Learning-based Sound Source Localizationmentioning
confidence: 99%
“…Such ideal conditions hardly hold true in real-world applications and usually require special treatments [5,6]. Data-driven methods, and in particular deep learning, have recently outperformed classical signal-processing methods for various audio tasks [7,8] including SSL [9,10].…”
Section: Introductionmentioning
confidence: 99%