2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) 2019
DOI: 10.1109/waspaa.2019.8937277
|View full text |Cite
|
Sign up to set email alerts
|

Regression Versus Classification for Neural Network Based Audio Source Localization

Abstract: We compare the performance of regression and classification neural networks for single-source direction-of-arrival estimation. Since the output space is continuous and structured, regression seems more appropriate. However, classification on a discrete spherical grid is widely believed to perform better and is predominantly used in the literature. For regression, we propose two ways to account for the spherical geometry of the output space based either on the angular distance between spherical coordinates or o… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

1
31
0

Year Published

2020
2020
2025
2025

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 37 publications
(34 citation statements)
references
References 24 publications
1
31
0
Order By: Relevance
“…During training of the DOAnet, pairwise Euclidean distances are computed between the M t predicted and N t reference DOAs, forming the distance matrix D. Euclidean distances are used instead of angular (cosine) distances, since they were found in [8], [16] to perform better during training. Note that we embed the pairwise distances in a D matrix of the maximum dimensions N max × N max , padding rows and columns beyond M t , N t with out-of-range values (i.e.…”
Section: B Differentiable Direction Of Arrival Network (Doanet)mentioning
confidence: 99%
See 2 more Smart Citations
“…During training of the DOAnet, pairwise Euclidean distances are computed between the M t predicted and N t reference DOAs, forming the distance matrix D. Euclidean distances are used instead of angular (cosine) distances, since they were found in [8], [16] to perform better during training. Note that we embed the pairwise distances in a D matrix of the maximum dimensions N max × N max , padding rows and columns beyond M t , N t with out-of-range values (i.e.…”
Section: B Differentiable Direction Of Arrival Network (Doanet)mentioning
confidence: 99%
“…A deep-learning paradigm on SSL opens up a few interesting research questions, such as basic spectrogram [8], [10] versus refined spatial [9], [11] multichannel input features, coupling the network architecture to SSL effectively [10], [14], choosing appropriate training source signals for generalization [10], [15], strong versus weak supervision [13], and posing SSL as a classification [7], [9]- [11] or regression [8], [12], [16] problem. The latter division was already present in earlier attempts of single-source deep-learning SSL, such as classification in [17] and regression in [18].…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…Other popular input representations for machine learning-based ASL include spectro-temporal features of the audio stream (STFT, Gammatone), or the waveforms themselves [12,13]. As for the output target, DOA estimation is often cast as a multi-label classification problem, or as regression of Cartesian coordinates [14]. A drawback of classification is that the cross-entropy loss between one-hot encoded targets and predictions does not take actual angular distances into account, while direct regression of source coordinates does not support variable numbers of speakers [15].…”
Section: Introductionmentioning
confidence: 99%
“…The estimated DOA on the DNN output can be represented in a classification manner, where a class activity symbolizes an active source from the corresponding direction, or a regression manner, where a single variable represents the DOA (e.g., an angle). According to [28], both representations yield comparable results such that the output representation of the DOA is a design choice. Some of the DNNs for DOA estimation (referred to as DDNNs) are trained with directional noise signals (e.g., [40,45,52]) as this allows to generate an infinite amount of simulated training data.…”
Section: Introductionmentioning
confidence: 99%