Localizing speakers in multiple rooms by using Deep Neural Networks

Vesperini, Fabio; Vecchiotti, Paolo; Principi, Emanuele; Squartini, Stefano; Piazza, Francesco

doi:10.1016/j.csl.2017.12.002

Cited by 29 publications

(21 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The baseline system is a state-of-the-art DNN-based localisation system using GCC-PHAT features as inputs [6,22]. GCC-PHAT features are computed as the inverse transform of the frequency domain cross-correlation of two audio signals captured by a microphone pair.…”

Section: Baseline Systemmentioning

confidence: 99%

“…In [5], probabilistic neural networks were used to estimate the direction of arrival (DOA) in an indoor environment using GCCbased features. A similar scenario was studied in [6] which used a convolutional neural network (CNN) to predict speaker coordinates. Binaural cues are employed in [7], where the cross-correlation function (CCF) was used as features in a DNN to estimate the azimuth of a sound source with simulated head movement.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

End-to-end Binaural Sound Localisation from the Raw Waveform

Vecchiotti

Squartini

et al. 2019

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

A novel end-to-end binaural sound localisation approach is proposed which estimates the azimuth of a sound source directly from the waveform. Instead of employing hand-crafted features commonly employed for binaural sound localisation, such as the interaural time and level difference, our end-to-end system approach uses a convolutional neural network (CNN) to extract specific features from the waveform that are suitable for localisation. Two systems are proposed which differ in the initial frequency analysis stage. The first system is auditory-inspired and makes use of a gammatone filtering layer, while the second system is fully data-driven and exploits a trainable convolutional layer to perform frequency analysis. In both systems, a set of dedicated convolutional kernels are then employed to search for specific localisation cues, which are coupled with a localisation stage using fully connected layers. Localisation experiments using binaural simulation in both anechoic and reverberant environments show that the proposed systems outperform a state-ofthe-art deep neural network system. Furthermore, our investigation of the frequency analysis stage in the second system suggests that the CNN is able to exploit different frequency bands for localisation according to the characteristics of the reverberant environment.

show abstract

Section: Baseline Systemmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

End-to-end Binaural Sound Localisation from the Raw Waveform

Vecchiotti

Squartini

et al. 2019

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

show abstract

“…The latter is extended to a multi-channel 3D-CNN system in [31], where log-Mel filterbank energies (40-dimensional) are employed as features, temporal context is exploited by concatenating adjacent time frames, and the resulting 2D single-microphone feature matrices are stacked across channels. Finally, in [32], the aforementioned 3D-CNN is combined with the GCC-PHAT [70] based CNN of [71] to yield a joint SAD and speaker localization network.…”

Section: Related Workmentioning

confidence: 99%

Room-localized speech activity detection in multi-microphone smart homes

Giannoulis

Potamianos

Maragos

2019

J AUDIO SPEECH MUSIC PROC.

View full text Add to dashboard Cite

Voice-enabled interaction systems in domestic environments have attracted significant interest recently, being the focus of smart home research projects and commercial voice assistant home devices. Within the multi-module pipelines of such systems, speech activity detection (SAD) constitutes a crucial component, providing input to their activation and speech recognition subsystems. In typical multi-room domestic environments, SAD may also convey spatial intelligence to the interaction, in addition to its traditional temporal segmentation output, by assigning speech activity at the room level. Such room-localized SAD can, for example, disambiguate user command referents, allow localized system feedback, and enable parallel voice interaction sessions by multiple subjects in different rooms. In this paper, we investigate a room-localized SAD system for smart homes equipped with multiple microphones distributed in multiple rooms, significantly extending our earlier work. The system employs a two-stage algorithm, incorporating a set of hand-crafted features specially designed to discriminate room-inside vs. room-outside speech at its second stage, refining SAD hypotheses obtained at its first stage by traditional statistical modeling and acoustic front-end processing. Both algorithmic stages exploit multi-microphone information, combining it at the signal, feature, or decision level. The proposed approach is extensively evaluated on both simulated and real data recorded in a multi-room, multi-microphone smart home, significantly outperforming alternative baselines. Further, it remains robust to reduced microphone setups, while also comparing favorably to deep learning-based alternatives.

show abstract

“…With the advent and huge increase of applications of deep neural networks in all areas of machine learning, promising works have also been proposed for ASL [ 5 , 6 , 7 , 8 , 9 , 10 , 11 , 12 , 13 , 14 , 15 , 16 , 17 , 18 , 57 , 58 , 59 , 60 , 61 ]. This is mainly due to the sophisticated capabilities and more careful implementation details of network architectures and the availability of advanced hardware architectures with increased computational capacity.…”

Section: State Of the Artmentioning

confidence: 99%

“…The idea of using neural networks for sound processing is not new and has gained popularity in recent years (especially for speech recognition [ 4 ]). In the context of ASL, deep learning methods have been recently developed [ 5 , 6 , 7 , 8 , 9 , 10 , 11 , 12 , 13 , 14 , 15 , 16 , 17 , 18 ]. Most of these works focus on obtaining the Direction of Arrival (DOA) of the acoustic source.…”

Section: Introductionmentioning

confidence: 99%

Towards End-to-End Acoustic Localization Using Deep Learning: From Audio Signals to Source Position Coordinates

Vera-Diaz

Pizarro

Macías-Guarasa

2018

Sensors

100

View full text Add to dashboard Cite

This paper presents a novel approach for indoor acoustic source localization using microphone arrays, based on a Convolutional Neural Network (CNN). In the proposed solution, the CNN is designed to directly estimate the three-dimensional position of a single acoustic source using the raw audio signal as the input information and avoiding the use of hand-crafted audio features. Given the limited amount of available localization data, we propose, in this paper, a training strategy based on two steps. We first train our network using semi-synthetic data generated from close talk speech recordings. We simulate the time delays and distortion suffered in the signal that propagate from the source to the array of microphones. We then fine tune this network using a small amount of real data. Our experimental results, evaluated on a publicly available dataset recorded in a real room, show that this approach is able to produce networks that significantly improve existing localization methods based on SRP-PHAT strategies and also those presented in very recent proposals based on Convolutional Recurrent Neural Networks (CRNN). In addition, our experiments show that the performance of our CNN method does not show a relevant dependency on the speaker’s gender, nor on the size of the signal window being used.

show abstract

Localizing speakers in multiple rooms by using Deep Neural Networks

Cited by 29 publications

References 26 publications

End-to-end Binaural Sound Localisation from the Raw Waveform

End-to-end Binaural Sound Localisation from the Raw Waveform

Room-localized speech activity detection in multi-microphone smart homes

Towards End-to-End Acoustic Localization Using Deep Learning: From Audio Signals to Source Position Coordinates

Contact Info

Product

Resources

About