Joonas Nikunen scite author profile

In this paper, we propose a convolutional recurrent neural network for joint sound event localization and detection (SELD) of multiple overlapping sound events in threedimensional (3D) space. The proposed network takes a sequence of consecutive spectrogram time-frames as input and maps it to two outputs in parallel. As the first output, the sound event detection (SED) is performed as a multi-label classification task on each time-frame producing temporal activity for all the sound event classes. As the second output, localization is performed by estimating the 3D Cartesian coordinates of the direction-ofarrival (DOA) for each sound event class using multi-output regression. The proposed method is able to associate multiple DOAs with respective sound event labels and further track this association with respect to time. The proposed method uses separately the phase and magnitude component of the spectrogram calculated on each audio channel as the feature, thereby avoiding any method-and array-specific feature extraction. The method is evaluated on five Ambisonic and two circular array format datasets with different overlapping sound events in anechoic, reverberant and real-life scenarios. The proposed method is compared with two SED, three DOA estimation, and one SELD baselines. The results show that the proposed method is generic and applicable to any array structures, robust to unseen DOA values, reverberation, and low SNR scenarios. The proposed method achieved a consistently higher recall of the estimated number of DOAs across datasets in comparison to the best baseline. Additionally, this recall was observed to be significantly better than the best baseline method for a higher number of overlapping sound events.

show abstract

Direction of Arrival Based Spatial Covariance Model for Blind Sound Source Separation

Nikunen

Virtanen

2014

IEEE/ACM Trans. Audio Speech Lang. Process.

120

View full text Add to dashboard Cite

Distant speech separation using predicted time–frequency masks from spatial features

Pertilä

Nikunen

2015

Speech Communication

View full text Add to dashboard Cite

Speech separation algorithms are faced with a difficult task of producing high degree of separation without containing unwanted artifacts. The time-frequency (T-F) masking technique applies a real-valued (or binary) mask on top of the signal's spectrum to filter out unwanted components. The practical difficulty lies in the mask estimation. Often, using efficient masks engineered for separation performance leads to presence of unwanted musical noise artifacts in the separated signal. This lowers the perceptual quality and intelligibility of the output.Microphone arrays have been long studied for processing of distant speech. This work uses a feed-forward neural network for mapping microphone array's spatial features into a T-F mask. Wiener filter is used as a desired mask for training the neural network using speech examples in simulated setting. The T-F masks predicted by the neural network are combined to obtain an enhanced separation mask that exploits the information regarding interference between all sources. The final mask is applied to the delay-and-sum beamformer (DSB) output.The algorithm's objective separation capability in conjunction with the separated speech intelligibility is tested with recorded speech from distant talkers in two rooms from two distances. The results show improvement in instrumental measure for intelligibility and frequency-weighted SNR over complex-valued non-negative matrix factorization (CNMF) source separation approach, spatial sound source separation, and conventional beamforming methods such as the DSB and minimum variance distortionless response (MVDR).

show abstract

Separation of Moving Sound Sources Using Multichannel NMF and Acoustic Tracking

Nikunen

Virtanen

2018

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

In this paper we propose a method for separation of moving sound sources. The method is based on first tracking the sources and then estimation of source spectrograms using multichannel non-negative matrix factorization (NMF) and extracting the sources from the mixture by single-channel Wiener filtering. We propose a novel multichannel NMF model with time-varying mixing of the sources denoted by spatial covariance matrices (SCM) and provide update equations for optimizing model parameters minimizing squared Frobenius norm. The SCMs of the model are obtained based on estimated directions of arrival of tracked sources at each time frame. The evaluation is based on established objective separation criteria and using real recordings of two and three simultaneous moving sound sources. The compared methods include conventional beamforming and ideal ratio mask separation. The proposed method is shown to exceed the separation quality of other evaluated blind approaches according to all measured quantities. Additionally, we evaluate the method's susceptibility towards tracking errors by comparing the separation quality achieved using annotated ground truth source trajectories.

show abstract

Multichannel audio upmixing based on non-negative tensor factorization representation

Nikunen

Virtanen

Vilermo

2011

View full text Add to dashboard Cite

This paper proposes a new spatial audio coding (SAC) method that is based on parametrization of multichannel audio by sound objects using non-negative tensor factorization (NTF). The spatial parameters are estimated using perceptually motivated NTF model and are used for upmixing a downmixed and encoded mixture signal. The performance of the proposed coding is evaluated using listening tests, which prove the coding performance being on a par with conventional SAC methods. Additionally the proposed coding enables controlling the upmix content by meaningful objects.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Joonas Nikunen

Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks

Direction of Arrival Based Spatial Covariance Model for Blind Sound Source Separation

Distant speech separation using predicted time–frequency masks from spatial features

Separation of Moving Sound Sources Using Multichannel NMF and Acoustic Tracking

Multichannel audio upmixing based on non-negative tensor factorization representation

Contact Info

Product

Resources

About