SRP-DNN: Learning Direct-Path Phase Difference for Multiple Moving Sound Source Localization

Yang, Bing; Liu, Hong; Li, Xiaofei

doi:10.1109/icassp43922.2022.9746624

Cited by 12 publications

(10 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Input Feature Target Multi-source Frame-wise Variable-array (vs. Single-source) (vs. Chunk-wise) (vs. Fixed-array ) [9] 2021 Mag + Phase Spatial spectrum regression [10] 2021 Intensity vector Multi-class location classification [11] 2021 Mag + Phase DP-RTF regression (2-channel) [12] 2021 SRP-PHAT Spectrogram Location regression [13] 2022 SRP-PHAT Spectrogram Location regression [14] 2022 STFT Coefficients Spatial spectrum regression [15] 2022 Mag + IPD Multi-track spatial spectrum regression [16] 2022 Mag + Phase Mixed DP-IPD regression [17] 2022 GCC-PHAT + Array Geometry Location classification (constant-channel) [18] 2023 MFCC and Mel features Multi-class location classification [19] 2023 SRP-PHAT Spectrogram Location regression [20] 2023 STFT Coefficients DP-IPD regression Proposed -STFT Coefficients Multi-track DP-IPD regression array or uses a fixed microphone array. In [17], by also taking as input the microphone array geometry along the localization feature to the network, the network can perform SSL for variable arrays.…”

Section: Methods Yearmentioning

confidence: 99%

“…Various network architectures have been adopted for SSL, among which convolutional neural networks (CNN) [9], [13], [14], [19] and convolutional recurrent neural Networks [11], [16], [18] (CRNN) are the most commonly used networks. These networks are all designed to process all the frequencies together.…”

Section: Related Work a Deep Learning Based Sound Source Localizationmentioning

confidence: 99%

“…These networks are all designed to process all the frequencies together. The network input can be in the signal level, such as the time-domain signal [33], the STFT coefficients [14], [20] or the magnitude and phase of STFT coefficients [9], [11], [16], or in the feature level, such as IPD, IID, the generalized cross-correlation (GCC) function [34]- [36] and noisy spatial spectrum [12], [13], [19].…”

Section: Related Work a Deep Learning Based Sound Source Localizationmentioning

confidence: 99%

“…According to the learning target, SSL methods are classified as feature/location regression or location classification methods. Feature/location regression methods estimate the localization feature (such as DP-RTF, DP-IPD and interchannel time difference (ITD)) [11], [16], [20], [36]- [39] or directly estimate source location [9], [12]- [14], [19] from the noisy signal or noisy localization features. Most works output the feature/location for one source, and few works study how to extend feature/location regression to multiple sources.…”

Section: Related Work a Deep Learning Based Sound Source Localizationmentioning

confidence: 99%

See 3 more Smart Citations

Enhancing direct‐path relative transfer function using deep neural network for robust sound source localization

Yang

Ding

Ban

et al. 2021

CAAI Trans on Intel Tech

View full text Add to dashboard Cite

This article proposes a deep neural network (DNN)-based direct-path relative transfer function (DP-RTF) enhancement method for robust direction of arrival (DOA) estimation in noisy and reverberant environments. The DP-RTF refers to the ratio between the directpath acoustic transfer functions of the two microphone channels. First, the complex-value DP-RTF is decomposed into the inter-channel intensity difference, and sinusoidal functions of the inter-channel phase difference in the time-frequency domain. Then, the decomposed DP-RTF features from a series of temporal context frames are utilized to train a DNN model, which maps the DP-RTF features contaminated by noise and reverberation to the clean ones, and meanwhile provides a time-frequency (TF) weight to indicate the reliability of the mapping. The DP-RTF enhancement network can help to enhance the DP-RTF against noise and reverberation. Finally, the DOA of a sound source can be estimated by integrating the weighted matching between the enhanced DP-RTF features and the DP-RTF templates. Experimental results on simulated data show the superiority of the proposed DP-RTF enhancement network for estimating the DOA of the sound source in the environments with various levels of noise and reverberation.This is an open access article under the terms of the Creative Commons Attribution-NonCommercial-NoDerivs License, which permits use and distribution in any medium, provided the original work is properly cited, the use is non-commercial and no modifications or adaptations are made.

show abstract

Section: Methods Yearmentioning

confidence: 99%

Section: Related Work a Deep Learning Based Sound Source Localizationmentioning

confidence: 99%

Section: Related Work a Deep Learning Based Sound Source Localizationmentioning

confidence: 99%

Section: Related Work a Deep Learning Based Sound Source Localizationmentioning

confidence: 99%

See 2 more Smart Citations

Enhancing direct‐path relative transfer function using deep neural network for robust sound source localization

Yang

Ding

Ban

et al. 2021

CAAI Trans on Intel Tech

View full text Add to dashboard Cite

show abstract

“…Recent work also explored distributed microphone arrays 1 . However, they did not satisfy the above goals: they were evaluated in simulated or strongly constrained environments [22][23][24][25] , required exact microphone positions [26][27][28][29] , used wired setups to achieve synchronization 26,30,31 , localized only 1-2 speakers [31][32][33][34][35] , or assumed a priori knowledge about the number of speakers [36][37][38] .…”

Section: Speech Separation and 2d Localizationmentioning

confidence: 99%

Creating speech zones with self-distributing acoustic swarms

Itani,

Chen,

Yoshioka

et al. 2023

Nat Commun

View full text Add to dashboard Cite

Imagine being in a crowded room with a cacophony of speakers and having the ability to focus on or remove speech from a specific 2D region. This would require understanding and manipulating an acoustic scene, isolating each speaker, and associating a 2D spatial context with each constituent speech. However, separating speech from a large number of concurrent speakers in a room into individual streams and identifying their precise 2D locations is challenging, even for the human brain. Here, we present the first acoustic swarm that demonstrates cooperative navigation with centimeter-resolution using sound, eliminating the need for cameras or external infrastructure. Our acoustic swarm forms a self-distributing wireless microphone array, which, along with our attention-based neural network framework, lets us separate and localize concurrent human speakers in the 2D space, enabling speech zones. Our evaluations showed that the acoustic swarm could localize and separate 3-5 concurrent speech sources in real-world unseen reverberant environments with median and 90-percentile 2D errors of 15 cm and 50 cm, respectively. Our system enables applications like mute zones (parts of the room where sounds are muted), active zones (regions where sounds are captured), multi-conversation separation and location-aware interaction.

show abstract