Sound source localization based on deep neural networks with directional activate function exploiting phase information

Takeda, Ryo; Komatani, Kazunori

doi:10.1109/icassp.2016.7471706

Cited by 142 publications

(99 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Additionally, for each architecture, we tune the model parameters such as the number of CNN, RNN, and FC layers (0 to 4) and nodes (in the set of [16,32,64,128,256,512]). The input sequence length is tuned in the set of [32,64,128,256,512], the DOA and SED branch output loss weights in the set of [1,5,50,500], the regularization (dropout in the set of [0, 0.1, 0.2, 0.3, 0.4, 0.5], L1 and L2 in the set of [0, 10 −1 ,10 −2 ,10 −3 ,10 −4 ,10 −5 ,10 −6 ,10 −7 ]) and the CNN max-pooling in the set of [2,4,6,8,16] for each layer. The best set of parameters are the ones which give the lowest SELD score on the three cross-validation splits of the dataset.…”

Section: Methodsmentioning

confidence: 99%

Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks

Adavanne¹,

Politis²,

Nikunen³

et al. 2019

IEEE J. Sel. Top. Signal Process.

401

378

View full text Add to dashboard Cite

In this paper, we propose a convolutional recurrent neural network for joint sound event localization and detection (SELD) of multiple overlapping sound events in threedimensional (3D) space. The proposed network takes a sequence of consecutive spectrogram time-frames as input and maps it to two outputs in parallel. As the first output, the sound event detection (SED) is performed as a multi-label classification task on each time-frame producing temporal activity for all the sound event classes. As the second output, localization is performed by estimating the 3D Cartesian coordinates of the direction-ofarrival (DOA) for each sound event class using multi-output regression. The proposed method is able to associate multiple DOAs with respective sound event labels and further track this association with respect to time. The proposed method uses separately the phase and magnitude component of the spectrogram calculated on each audio channel as the feature, thereby avoiding any method-and array-specific feature extraction. The method is evaluated on five Ambisonic and two circular array format datasets with different overlapping sound events in anechoic, reverberant and real-life scenarios. The proposed method is compared with two SED, three DOA estimation, and one SELD baselines. The results show that the proposed method is generic and applicable to any array structures, robust to unseen DOA values, reverberation, and low SNR scenarios. The proposed method achieved a consistently higher recall of the estimated number of DOAs across datasets in comparison to the best baseline. Additionally, this recall was observed to be significantly better than the best baseline method for a higher number of overlapping sound events.

show abstract

Section: Methodsmentioning

confidence: 99%

Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks

Adavanne¹,

Politis²,

Nikunen³

et al. 2019

IEEE J. Sel. Top. Signal Process.

401

378

View full text Add to dashboard Cite

show abstract

“…In [14], [16], GCC vectors, computed from the microphone signals, are provided as input to the learning framework. In [15], [17], similar to the computations involved in the MUSIC method for localization, the eigenvalue decomposition of the spatial correlation matrix is performed to get the eigenvectors corresponding to the noise subspace, and is provided as input to a neural network. In [13], a binaural setup is considered and binaural cues at different frequency sub-bands are computed and given as input.…”

Section: Introductionmentioning

confidence: 99%

Multi-Speaker DOA Estimation Using Deep Convolutional Networks Trained With Noise Signals

Chakrabarty

Habets

2019

IEEE J. Sel. Top. Signal Process.

274

229

View full text Add to dashboard Cite

Supervised learning based methods for source localization, being data driven, can be adapted to different acoustic conditions via training and have been shown to be robust to adverse acoustic environments. In this paper, a convolutional neural network (CNN) based supervised learning method for estimating the direction-of-arrival (DOA) of multiple speakers is proposed. Multi-speaker DOA estimation is formulated as a multi-class multi-label classification problem, where the assignment of each DOA label to the input feature is treated as a separate binary classification problem. The phase component of the shorttime Fourier transform (STFT) coefficients of the received microphone signals are directly fed into the CNN, and the features for DOA estimation are learnt during training. Utilizing the assumption of disjoint speaker activity in the STFT domain, a novel method is proposed to train the CNN with synthesized noise signals. Through experimental evaluation with both simulated and measured acoustic impulse responses, the ability of the proposed DOA estimation approach to adapt to unseen acoustic conditions and its robustness to unseen noise type is demonstrated. Through additional empirical investigation, it is also shown that with an array of M microphones our proposed framework yields the best localization performance with M-1 convolution layers. The ability of the proposed method to accurately localize speakers in a dynamic acoustic scenario with varying number of sources is also shown.

show abstract

“…It is also helpful to improve the performance of speech enhancement and separation systems if we could know the types of sounds [4,5]. Robotic systems can employ SED for navigation and natural interaction with surrounding acoustic environments [6,7]. Smart home devices can benefit from it for environmental sound understanding [8,9].…”

Section: Introductionmentioning

confidence: 99%

Sound Event Detection in Multichannel Audio Using Convolutional Time-Frequency-Channel Squeeze and Excitation

Xia¹

2019

Interspeech 2019

View full text Add to dashboard Cite

In this study, we introduce a convolutional time-frequencychannel "Squeeze and Excitation" (tfc-SE) module to explicitly model inter-dependencies between the time-frequency domain and multiple channels. The tfc-SE module consists of two parts: tf-SE block and c-SE block which are designed to provide attention on time-frequency and channel domain, respectively, for adaptively recalibrating the input feature map. The proposed tfc-SE module, together with a popular Convolutional Recurrent Neural Network (CRNN) model, are evaluated on a multi-channel sound event detection task with overlapping audio sources: the training and test data are synthesized TUT Sound Events 2018 datasets, recorded with microphone arrays. We show that the tfc-SE module can be incorporated into the CRNN model at a small additional computational cost and bring significant improvements on sound event detection accuracy. We also perform detailed ablation studies by analyzing various factors that may influence the performance of the SE blocks. We show that with the best tfc-SE block, error rate (ER) decreases from 0.2538 to 0.2026, relative 20.17% reduction of ER, and 5.72% improvement of F1 score. The results indicate that the learned acoustic embeddings with the tfc-SE module efficiently strengthen time-frequency and channel-wise feature representations to improve the discriminative performance.

show abstract

Sound source localization based on deep neural networks with directional activate function exploiting phase information

Cited by 142 publications

References 18 publications

Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks

Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks

Multi-Speaker DOA Estimation Using Deep Convolutional Networks Trained With Noise Signals

Sound Event Detection in Multichannel Audio Using Convolutional Time-Frequency-Channel Squeeze and Excitation

Contact Info

Product

Resources

About