Regression Versus Classification for Neural Network Based Audio Source Localization

Perotin, Laureline; Défossez, Alexandre; Serizel, Romain; Guérin, Alexandre

doi:10.1109/waspaa.2019.8937277

Cited by 37 publications

(34 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…During training of the DOAnet, pairwise Euclidean distances are computed between the M t predicted and N t reference DOAs, forming the distance matrix D. Euclidean distances are used instead of angular (cosine) distances, since they were found in [8], [16] to perform better during training. Note that we embed the pairwise distances in a D matrix of the maximum dimensions N max × N max , padding rows and columns beyond M t , N t with out-of-range values (i.e.…”

Section: B Differentiable Direction Of Arrival Network (Doanet)mentioning

confidence: 99%

“…A deep-learning paradigm on SSL opens up a few interesting research questions, such as basic spectrogram [8], [10] versus refined spatial [9], [11] multichannel input features, coupling the network architecture to SSL effectively [10], [14], choosing appropriate training source signals for generalization [10], [15], strong versus weak supervision [13], and posing SSL as a classification [7], [9]- [11] or regression [8], [12], [16] problem. The latter division was already present in earlier attempts of single-source deep-learning SSL, such as classification in [17] and regression in [18].…”

Section: Introductionmentioning

confidence: 99%

“…Classification-based SSL was the dominant paradigm until recently, where studies such as [8] brought increased attention to regression, with similar performance to classification further validated, e.g., in [16]. Regression-based SSL has its own advantages: a single regressor on DOA vectors or angles can handle the whole DOA domain for a single source with one to three outputs, estimation is continuous, and moving source scenarios are handled naturally [19], [20].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Differentiable Tracking-Based Training of Deep Learning Sound Source Localizers

Adavanne¹,

Politis²,

Virtanen³

2021

Preprint

View full text Add to dashboard Cite

Data-based and learning-based sound source localization (SSL) has shown promising results in challenging conditions, and is commonly set as a classification or a regression problem. Regression-based approaches have certain advantages over classification-based, such as continuous direction-of-arrival estimation of static and moving sources. However, multi-source scenarios require multiple regressors without a clear training strategy up-to-date, that does not rely on auxiliary information such as simultaneous sound classification. We investigate end-toend training of such methods with a technique recently proposed for video object detectors, adapted to the SSL setting. A differentiable network is constructed that can be plugged to the output of the localizer to solve the optimal assignment between predictions and references, optimizing directly the popular CLEAR-MOT tracking metrics. Results indicate large improvements over directly optimizing mean squared errors, in terms of localization error, detection metrics, and tracking capabilities.

show abstract

Section: B Differentiable Direction Of Arrival Network (Doanet)mentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Differentiable Tracking-Based Training of Deep Learning Sound Source Localizers

Adavanne¹,

Politis²,

Virtanen³

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Other popular input representations for machine learning-based ASL include spectro-temporal features of the audio stream (STFT, Gammatone), or the waveforms themselves [12,13]. As for the output target, DOA estimation is often cast as a multi-label classification problem, or as regression of Cartesian coordinates [14]. A drawback of classification is that the cross-entropy loss between one-hot encoded targets and predictions does not take actual angular distances into account, while direct regression of source coordinates does not support variable numbers of speakers [15].…”

Section: Introductionmentioning

confidence: 99%

Acoustic Reflectors Localization from Stereo Recordings Using Neural Networks

Bologni

Heusdens

Martínez

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Acoustic room geometry estimation is often performed in ad hoc settings, i.e., using multiple microphones and sources distributed around the room, or assuming control over the excitation signals. We propose a fully convolutional network (FCN) that localizes reflective surfaces under the relaxed assumptions that (i) a compact array of only two microphones is available, (ii) emitter and receivers are not synchronized, and (iii) both the excitation signals and the impulse responses of the enclosures are unknown. Our FCN is trained in a supervised fashion to predict the likelihood of reflective surfaces at specific distances and directions-of-arrival (DOA). When a single reflective surface is present, up to 80% of real and virtual sources are detected, while this figure approaches 50% in rectangular rooms. Experiments on real-world recordings report similar accuracy as with artificially reverberated speech signals, validating the generalization capabilities of the framework.

show abstract

“…The estimated DOA on the DNN output can be represented in a classification manner, where a class activity symbolizes an active source from the corresponding direction, or a regression manner, where a single variable represents the DOA (e.g., an angle). According to [28], both representations yield comparable results such that the output representation of the DOA is a design choice. Some of the DNNs for DOA estimation (referred to as DDNNs) are trained with directional noise signals (e.g., [40,45,52]) as this allows to generate an infinite amount of simulated training data.…”

Section: Introductionmentioning

confidence: 99%

Signal-Aware Broadband DOA Estimation Using Attention Mechanisms

Mack

Bharadwaj

Chakrabarty

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

The direction-of-arrival (DOA) of sound sources is an essential acoustic parameter used, e.g., for multi-channel speech enhancement or source tracking. Complex acoustic scenarios consisting of sources-of-interest, interfering sources, reverberation, and noise make the estimation of the DOAs corresponding to the sources-of-interest a challenging task. Recently proposed attention mechanisms allow DOA estimators to focus on the sources-of-interest and disregard interference and noise, i.e., they are signal-aware. The attention is typically obtained by a deep neural network (DNN) from a short-time Fourier transform (STFT) based representation of a single microphone signal. Subsequently, attention has been applied as binary or ratio weighting to STFT-based microphone signal representations to reduce the impact of frequency bins dominated by noise, interference, or reverberation. The impact of attention on DOA estimators and different training strategies for attention and DOA DNNs are not yet studied in depth. In this paper, we evaluate systems consisting of different DNNs and signal processing-based methods for DOA estimation when attention is applied. Additionally, we propose training strategies for attention-based DOA estimation optimized viaa DOA objective, i.e., end-to-end. The evaluation of the proposed and the baseline systems is performed using data generated with simulated and measured room impulse responses under various acoustic conditions, like reverberation times, noise, and source array distances. The best-performing systems are also evaluated using measured data. Our experiments show that DNNs used for DOA estimation are biased to the spectral source characteristics and the spectral attention distribution used during training (e.g., spectrally flat/sparse). We also show that this bias in the DOA estimator can be avoided if signal-processing methods are used in combination with attention. Overall, DOA estimation using attention in combination with signal-processing methods exhibits a far lower computational complexity than a fully DNN-based system; however, it yields comparable results.

show abstract

Regression Versus Classification for Neural Network Based Audio Source Localization

Cited by 37 publications

References 24 publications

Differentiable Tracking-Based Training of Deep Learning Sound Source Localizers

Differentiable Tracking-Based Training of Deep Learning Sound Source Localizers

Acoustic Reflectors Localization from Stereo Recordings Using Neural Networks

Signal-Aware Broadband DOA Estimation Using Attention Mechanisms

Contact Info

Product

Resources

About