Microphone arrays use spatial diversity for separating concurrent audio sources. Source signals from different directions of arrival (DOAs) are captured with DOAdependent time-delays between the microphones. These can be exploited in the short-time Fourier transform domain to yield time-frequency masks that extract a target signal while suppressing unwanted components. Using deep neural networks (DNNs) for mask estimation has drastically improved separation performance. However, separation of closely spaced sources remains difficult due to their similar inter-microphone time delays. We propose using auxiliary information on source DOAs within the DNN to improve the separation. This can be encoded by the expected phase differences between the microphones. Alternatively, the DNN can learn a suitable input representation on its own when provided with a multi-hot encoding of the DOAs. Experimental results demonstrate the benefit of this information for separating closely spaced sources.
We propose a multistage approach for enhancing speech captured by a drone-mounted microphone array. The key challenge is suppressing the drone ego-noise, which is the major source of interference in such captures. Since the location of the target is not known a priori, we first apply a UNet-based deep convolutional autoencoder (AE) individually to each microphone signal. The AE generates a time-frequency mask ∈ [0, 1] per signal, where high values correspond to time-frequency points with relatively good signal-to-noise ratios (SNRs). The masks are pooled across all microphones and the aggregated mask is used to steer an adaptive, frequency domain beamformer, yielding a signal with an improved SNR. This beamformer output, after being fed back to the AE, now yields an improved mask -which is used for re-focussing the beamformer. This combination of AE and beamformer, which can be applied to the signals in multiple 'passes' is termed multistage beamforming. The approach is developed and evaluated on a self-collected database. For the AE -when used to steer a beamformer -a training target that preserves more speech at the cost of less noise suppression outperforms an aggressive training target that suppresses more noise at the cost of more speech distortion. This, in combination with max-pooling of the multi-channel mask -which also lets through more speech (and noise) compared with median pooling -performs best. The experiments further demonstrate that the multistage approach brings extra benefit to the speech quality and intelligibility when the input SNR is ≥ −10 dB, and yields comprehensible outputs when the input has a SNR above −5 dB.
The field of speaker detection is relatively well researched. Multiple solutions focusing solely on audio or video, or a combination of both exist. On the audio side, a popular feature representation are mel-frequency cepstral coefficients, which are a sparse representation of the audio signal. On the video side, mostly pixel intensities are used, which is not sparse at all. In this paper, we take a look at a sparse video feature representation, namely facial landmarks. We first evaluate what selection of landmarks conveys the most information. Afterwards we propose several neural network architectures trained for audio-visual speaker detection. A comparison both on computational performance and accuracy is shown between the original architecture and architectures utilizing facial landmarks. For the evaluation, we introduce a new dataset to better understand the differences between the pixel and landmark features. The landmark features achieve similar accuracies for a forward oriented head position. There is a small reduction in performance for non-ideal head positions and in the case of occlusions. There is however a significant computational benefit, as there is a complexity reduction of orders two or three of magnitude due to the decreased feature dimensionality. When considering embedded devices, this is a big upside. This way we hope to provide insight and interest in a novel type of active speaker identification models.
For separating sources captured by ad hoc distributed microphones a key first step is assigning the microphones to the appropriate source-dominated clusters. The features used for such (blind) clustering are based on a fixed length embedding of the audio signals in a high-dimensional latent space. In previous work, the embedding was hand-engineered from the Mel frequency cepstral coefficients and their modulation-spectra. This paper argues that embedding frameworks designed explicitly for the purpose of reliably discriminating between speakers would produce more appropriate features. We propose features generated by the state-of-the-art ECAPA-TDNN speaker verification model for the clustering. We benchmark these features in terms of the subsequent signal enhancement as well as on the quality of the clustering where, further, we introduce 3 intuitive metrics for the latter. Results indicate that in contrast to the hand-engineered features, the ECAPA-TDNN-based features lead to more logical clusters and better performance in the subsequent enhancement stages -thus validating our hypothesis.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.