A generalized network based on multi-scale densely connection and residual attention for sound source localization and detection

Hu, Ying; Sun, Xinghao; He, Liang; Huang, Hao

doi:10.1121/10.0009671

Cited by 3 publications

(1 citation statement)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Various sound events exhibit the diversity of duration and frequency distribution range [27]. Designing an effective feature extractor is necessary to obtain richer spatial information for the SSLD task.…”

Section: Global-local Feature Extractormentioning

confidence: 99%

GLFER-Net: a polyphonic sound source localization and detection network based on global-local feature extraction and recalibration

Ma,

Hu,

et al. 2024

J AUDIO SPEECH MUSIC PROC.

Self Cite

View full text Add to dashboard Cite

Polyphonic sound source localization and detection (SSLD) task aims to recognize the categories of sound events, identify their onset and offset times, and detect their corresponding direction-of-arrival (DOA), where polyphonic refers to the occurrence of multiple overlapping sound sources in a segment. However, vanilla SSLD methods based on convolutional recurrent neural network (CRNN) suffer from insufficient feature extraction. The convolutions with kernel of single scale in CRNN fail to adequately extract multi-scale features of sound events, which have diverse time-frequency characteristics. It results in that the extracted features lack fine-grained information helpful for the localization of sound sources. In response to these challenges, we propose a polyphonic SSLD network based on global-local feature extraction and recalibration (GLFER-Net), where the global-local feature (GLF) extractor is designed to extract the multi-scale global features through an omni-directional dynamic convolution (ODConv) layer and multi-scale feature extraction (MSFE) module. The local feature extraction (LFE) unit is designed for capturing detailed information. Besides, we design a feature recalibration (FR) module to emphasize the crucial features along multiple dimensions. On the open datasets of Task3 in DCASE 2021 and 2022 Challenges, we compared our proposed GLFER-Net with six and four SSLD methods, respectively. The results show that the GLFER-Net achieves competitive performance. The modules we designed are verified to be effective through a series of ablation experiments and visualization analyses.

show abstract

Section: Global-local Feature Extractormentioning

confidence: 99%