Supervised Contrastive Embeddings for Binaural Source Localization

Liu

2022

Preprint

Direct-path relative transfer function (DP-RTF) refers to the ratio between the direct-path acoustic transfer functions of two microphone channels. Though DP-RTF fully encodes the sound spatial cues and serves as a reliable localization feature, it is often erroneously estimated in the presence of noise and reverberation. This paper proposes to learn DP-RTF with deep neural networks for robust binaural sound source localization. A DP-RTF learning network is designed to regress the binaural sensor signals to a real-valued representation of DP-RTF. It consists of a branched convolutional neural network module to separately extract the inter-channel magnitude and phase patterns, and a convolutional recurrent neural network module for joint feature learning. To better explore the speech spectra to aid the DP-RTF estimation, a monaural speech enhancement network is used to recover the direct-path spectrograms from the noisy ones. The enhanced spectrograms are stacked onto the noisy spectrograms to act as the input of the DP-RTF learning network. We train one unique DP-RTF learning network using many different binaural arrays to enable the generalization of DP-RTF learning across arrays. This way avoids time-consuming training data collection and network retraining for a new array, which is very useful in practical application. Experimental results on both simulated and real-world data show the effectiveness of the proposed method for direction of arrival (DOA) estimation in the noisy and reverberant environment, and a good generalization ability to unseen binaural arrays.

Section: A Deep Learning Based Sound Source Localizationmentioning

confidence: 99%

Section: B Enhancement Of Signal Spectra and Localization Featuresmentioning

confidence: 99%

Learning Deep Direct-Path Relative Transfer Function for Binaural Sound Source Localization

Liu

2022

Preprint

“…With the development of deep learning techniques, lots of data-driven sound source localization works are built in a supervised manner [1]. According to the role of the deep learning model plays, these methods are classified into four categories, namely signal-to-location [2], feature-to-location [3,4], spatial spectrum-to-location [5], and feature-to-feature [6,7]-based methods. Among these methods, the feature-to-feature-based method is simple and effective for improving the performance of sound source localization in noisy and reverberant environments, as it is the data driven and the extracted features can adapt to various acoustic conditions.…”

Section: Introductionmentioning

confidence: 99%

Enhancing direct‐path relative transfer function using deep neural network for robust sound source localization

CAAI Trans on Intel Tech

Ding

Ban

et al. 2021

This article proposes a deep neural network (DNN)-based direct-path relative transfer function (DP-RTF) enhancement method for robust direction of arrival (DOA) estimation in noisy and reverberant environments. The DP-RTF refers to the ratio between the directpath acoustic transfer functions of the two microphone channels. First, the complex-value DP-RTF is decomposed into the inter-channel intensity difference, and sinusoidal functions of the inter-channel phase difference in the time-frequency domain. Then, the decomposed DP-RTF features from a series of temporal context frames are utilized to train a DNN model, which maps the DP-RTF features contaminated by noise and reverberation to the clean ones, and meanwhile provides a time-frequency (TF) weight to indicate the reliability of the mapping. The DP-RTF enhancement network can help to enhance the DP-RTF against noise and reverberation. Finally, the DOA of a sound source can be estimated by integrating the weighted matching between the enhanced DP-RTF features and the DP-RTF templates. Experimental results on simulated data show the superiority of the proposed DP-RTF enhancement network for estimating the DOA of the sound source in the environments with various levels of noise and reverberation.This is an open access article under the terms of the Creative Commons Attribution-NonCommercial-NoDerivs License, which permits use and distribution in any medium, provided the original work is properly cited, the use is non-commercial and no modifications or adaptations are made.

“…Deep learning has been successfully applied to the localization task recently. Under the dual-stage localization framework, deep neural network (DNN) can be used to either extract localization features [3,4], or build the mapping from the localization features to source location [5,6]. Commonly used localization feature includes inter-channel time difference (ITD) [7], inter-channel phase difference (IPD) [8], inter-channel intensity difference (IID), relative transfer function (RTF) [9,10], etc.…”

Section: Introductionmentioning

confidence: 99%

Supervised Direct-Path Relative Transfer Function Learning for Binaural Sound Source Localization

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Liu

2021

Direct-path relative transfer function (DP-RTF) refers to the ratio between the direct-path acoustic transfer functions of two channels. Though DP-RTF fully encodes the sound directional cues and serves as a reliable localization feature, it is often erroneously estimated in the presence of noise and reverberation. This paper proposes a supervised DP-RTF learning method with deep neural networks for robust binaural sound source localization. To exploit the complementarity of single-channel spectrogram and dual-channel difference information, we first recover the direct-path magnitude spectrogram from the contaminated one using a monaural enhancement network, and then predict the DP-RTF from the dual-channel (enhanced-) intensity and phase cues using a binaural enhancement network. In addition, a weighted-matching softmax training loss is designed to promote the predicted DP-RTFs to be concentrated for the same direction and separated for different directions. Finally, the direction of arrival (DOA) of source is estimated by matching the predicted DP-RTF with the ground truths of candidate directions. Experimental results show the superiority of our method for DOA estimation in the environments with various levels of noise and reverberation.