Neural Network Adaptation and Data Augmentation for Multi-Speaker Direction-of-Arrival Estimation

He, Weipeng; Motlíček, Petr; Odobez, Jean-Marc

doi:10.1109/taslp.2021.3060257

Cited by 36 publications

(28 citation statements)

References 41 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…M is the number of azimuth directions, here M = 181. di and d i are the predicted and ground-truth DOA coding of the target speaker. Based on the likelihood-based coding in [21], the desired ground-truth values d i are defined as follows:…”

Section: End-to-end Trainingmentioning

confidence: 99%

L-SpEx: Localized Target Speaker Extraction

Meng¹,

Xu²,

Wang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Speaker extraction aims to extract the target speaker's voice from a multi-talker speech mixture given an auxiliary reference utterance. Recent studies show that speaker extraction benefits from the location or direction of the target speaker. However, these studies assume that the target speaker's location is known in advance or detected by an extra visual cue, e.g., face image or video. In this paper, we propose an end-toend localized target speaker extraction on pure speech cues, that is called L-SpEx. Specifically, we design a speaker localizer driven by the target speaker's embedding to extract the spatial features, including direction-of-arrival (DOA) of the target speaker and beamforming output. Then, the spatial cues and target speaker's embedding are both used to form a top-down auditory attention to the target speaker. Experiments on the multi-channel reverberant dataset called MC-Libri2Mix show that our L-SpEx approach significantly outperforms the baseline system.

show abstract

Section: End-to-end Trainingmentioning

confidence: 99%

L-SpEx: Localized Target Speaker Extraction

Meng¹,

Xu²,

Wang³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…In the SSL literature, a great proportion of systems focuses on localizing speech sources, because of its importance in related tasks such as speech enhancement or speech recognition. Examples of speaker localization systems can be found in [39], [40], [41], [42]. In such systems, the neural networks are trained to estimate the DoA of speech sources so that they are somehow specialized in this type of source.…”

Section: B Source Typesmentioning

confidence: 99%

“…Several systems consider only the magnitude spectrograms, such as [52], [140], [199], [204], while other consider only the phase spectrogram [128], [203] When considering both magnitude and phase, they can be stacked also in a third dimension (as well as channels). This representation has been employed in many neural-based SSL systems [41], [70], [131], [143], [147], [148], [152], [153], [187]. Other systems proposed to decompose the complexvalued spectrograms into real and imaginary parts [42], [119], [192], [205].…”

Section: Spectrogram-based Featuresmentioning

confidence: 99%

“…Finally, we also found a series of works in which the neural networks are tested using real data specifically recorded for the presented work in the researchers' own laboratories, e.g., [1], [40], [41], [45], [63], [66], [105], [129], [192], [212].…”

Section: B Real Datamentioning

confidence: 99%

“…When dealing with multiple sources, still with the classification paradigm, sigmoid activation functions and a binary cross-entropy loss function are used, see e.g., [60], [63], [118]. With a regression scheme, the choice for the cost function has been the mean square error in most systems [41], [123], [131], [135], [142], [177], [204]. We also sometimes witness the use of other cost functions, such as the angular error [212] and the 1 -norm [57].…”

Section: A Supervised Learningmentioning

confidence: 99%

See 2 more Smart Citations

A Survey of Sound Source Localization with Deep Learning Methods

Grumiaux,

Kitić,

Girin

et al. 2021

Preprint

View full text Add to dashboard Cite

This article is a survey on deep learning methods for single and multiple sound source localization. We are particularly interested in sound source localization in indoor/domestic environment, where reverberation and diffuse noise are present. We provide an exhaustive topography of the neural-based localization literature in this context, organized according to several aspects: the neural network architecture, the type of input features, the output strategy (classification or regression), the types of data used for model training and evaluation, and the model training strategy. This way, an interested reader can easily comprehend the vast panorama of the deep learning-based sound source localization methods. Tables summarizing the literature survey are provided at the end of the paper for a quick search of methods with a given set of target characteristics.

show abstract

Speaker identification and localization using shuffled MFCC features and deep learning

Barhoush

Hallawa

Schmeink

2023

Int J Speech Technol

View full text Add to dashboard Cite

The use of machine learning in automatic speaker identification and localization systems has recently seen significant advances. However, this progress comes at the cost of using complex models, computations, and increasing the number of microphone arrays and training data. Therefore, in this work, we propose a new end-to-end identification and localization model based on a simple fully connected deep neural network (FC-DNN) and just two input microphones. This model can jointly or separately localize and identify an active speaker with high accuracy in single and multi-speaker scenarios by exploiting a new data augmentation approach. In this regard, we propose using a novel Mel Frequency Cepstral Coefficients (MFCC) based feature called Shuffled MFCC (SHMFCC) and its variant Difference Shuffled MFCC (DSHMFCC). In order to test our approach, we analyzed the performance of the identification and localization proposed model on the new features at different noise and reverberation conditions for single and multi-speaker scenarios. The results show that our approach achieves high accuracy in these scenarios, outperforms the baseline and conventional methods, and achieves robustness even with small-sized training data.

show abstract

Neural Network Adaptation and Data Augmentation for Multi-Speaker Direction-of-Arrival Estimation

Cited by 36 publications

References 41 publications

L-SpEx: Localized Target Speaker Extraction

L-SpEx: Localized Target Speaker Extraction

A Survey of Sound Source Localization with Deep Learning Methods

Speaker identification and localization using shuffled MFCC features and deep learning

Contact Info

Product

Resources

About