Space-time voice activity detection

2017 Hands-Free Speech Communications and Microphone Arrays (HSCMA)

2017

This paper addresses the problem of Target Activity Detection (TAD) for binaural listening devices. TAD denotes the problem of robustly detecting the activity of a target speaker in a harsh acoustic environment, which comprises interfering speakers and noise ('cocktail party scenario'). In previous work, it has been shown that employing a Feed-forward Neural Network (FNN) for detecting the target speaker activity is a promising approach to combine the advantage of different TAD features (used as network inputs). In this contribution, we exploit a larger context window for TAD and compare the performance of FNNs and Recurrent Neural Networks (RNNs) with an explicit focus on small network topologies as desirable for embedded acoustic signal processing systems. More specifically, the investigations include a comparison between three different types of RNNs, namely plain RNNs, Long Short-Term Memories, and Gated Recurrent Units. The results indicate that all versions of RNNs outperform FNNs for the task of TAD.

Section: Introductionmentioning

confidence: 99%

Efficient target activity detection based on recurrent neural networks

Gerber

2017 Hands-Free Speech Communications and Microphone Arrays (HSCMA)

2017

“…In order to be able to exploit spatial information, multi-microphone recordings are required. Conventional methods for acoustic source localization can be modified to allow a discrimination between multiple point sources [5] or between background noise (assumed to be incoherent) and point sources [6]. Similarly, the position of the null of an adaptive nullsteering beamformer can be tracked, indicating a dominant target source if the null is steered towards the target source position [7].…”

Section: Introductionmentioning

confidence: 99%

Analysis of the robustness of neural network-based target activity detection

Gerber

2017 25th European Signal Processing Conference (EUSIPCO)

2017

Abstract-Many applications in audio signal processing require a precise identification of time frames where a predefined target source is active. In previous work, Artificial Neural Networks (ANNs) with crosscorrelation features showed a considerable potential in this field. In this paper, the performance of ANNbased target activity detection is analyzed in more detail and compared with a well-performing "classical" signal processing method. On the one hand, the impact of the angular distance between target source and interferers is evaluated for both the neural network-based method and the classical one. On the other hand, the sensitivity of both methods to varying Signal-to-Noise Ratio (SNR) conditions is analyzed with respect to the importance of a proper choice of detection thresholds. In the evaluations, the ANN-based method proves its general superiority and also its robustness with respect to a non-ideal choice of detection thresholds.

“…Conventional acoustic source localization techniques for multi-microphone arrays can be modified to provide information on target source activity. For instance, the Steered Response Power (SRP) method can be exploited to either distinguish between multiple point sources [4] or between point sources and incoherent background noise [5]. Similarly, the cross-correlation function between two microphones can be calculated, allowing for a detection of target activity when a peak is observed for the time lag corresponding to a target source position [6,7,8].…”

Section: Introductionmentioning

confidence: 99%

Artificial Neural Network-Based Feature Combination for Spatial Voice Activity Detection

2016

Interspeech 2016

For many applications in speech communications and speechbased human-machine interaction, a reliable Voice Activity Detection (VAD) is crucial. Conventional methods for VAD typically differentiate between a target speaker and background noise by exploiting characteristic properties of speech signals. If a target speaker should be distinguished from other speech sources, these conventional concepts are no longer applicable, and other methods, typically exploiting the spatial diversity of the individual sources, are required. Often, it is beneficial to combine several features in order to improve the overall decision. Optimum combinations of features, however, depend strongly on the scenario, especially on the position of the target source, the characteristics of noise and interference and the Signal-to-Interference Ratio (SIR). Moreover, choosing detection thresholds which are robust to changing scenarios is often a difficult problem. In this paper, these issues are addressed by introducing Artificial Neural Networks (ANNs) for spatial voice activity detection, which allow to combine several features with background information. The experimental results show that already small ANNs can significantly and robustly improve the detection rates, offering a valuable tool for VAD.