Yanjue Song scite author profile

IEEE/ACM Trans. Audio Speech Lang. Process.

2022

The periodic nature of voiced speech is often exploited to restore speech harmonics and to increase interharmonic noise suppression. In particular, a recent paper proposed to do this by manipulating the speech harmonic frequencies in the cepstral domain. The manipulations were carried out on the cepstrum of the excitation signal, obtained by the sourcefilter decomposition of speech. This method was termed Cepstral Excitation Manipulation (CEM). In this contribution we further analyse this method, point out its inherent weakness and propose means to overcome it. First of all, it will be shown by both illustrative examples and theoretical analysis that the existing method underestimates the excitation, especially at low signal to noise ratio (SNR) conditions. This inherent weakness leads to speech harmonic weakening and vocoding due to the insufficient noise suppression in the inter-harmonic regions. Then, we propose two modifications to improve the robustness and performance of CEM in low SNR cases. The first modification is to use an instantaneous amplifying factor adapted to the signal, instead of a pre-defined constant, for the excitation cepstrum. The second modification is to smooth the excitation cepstrum to preserve additional fine structure, instead of discarding it. These modifications result in better preservation of speech harmonics, more refined fine structure and higher inter-harmonic noise suppression. Experimental evaluations using a range of standard instrumental metrics conclusively demonstrate that our proposed modifications clearly outperform the existing method, especially in extremely noisy conditions.

Aiding Speech Harmonic Recovery in DNN-Based Single Channel Noise Reduction Using Cepstral Excitation Manipulation (CEM) Components

2023

Weak harmonics of voiced speech segments are often lost during the process of noise suppression -especially at low SNRs. This leads to a distortion in the harmonic structure, and an accompanying loss in quality. In this paper, inspired by previous work on speech harmonic enhancement using statistical methods, we present a loss function component we term cepstral excitation manipulation (CEM) loss, which is constructed based on the fundamental frequency-related cepstral coefficients. This component can be introduced to the training of state-of-the-art architectures and its benefit is benchmarked, here, on CRUSE. Experiments show that the proposed loss function component nicely supplements standard loss functions and the harmonic structure is better preserved. On average, the best system improves by 0.4 on PESQ and 0.47 on DNSMOS compared to the noisy input. Substantial improvements are primarily in low SNRs (-5 dB to 5 dB) -the range for which harmonic recovery is most required.

Investigations on the Optimal Estimation of Speech Envelopes for the Two-Stage Speech Enhancement

2023

Sensors

Using the source-filter model of speech production, clean speech signals can be decomposed into an excitation component and an envelope component that is related to the phoneme being uttered. Therefore, restoring the envelope of degraded speech during speech enhancement can improve the intelligibility and quality of output. As the number of phonemes in spoken speech is limited, they can be adequately represented by a correspondingly limited number of envelopes. This can be exploited to improve the estimation of speech envelopes from a degraded signal in a data-driven manner. The improved envelopes are then used in a second stage to refine the final speech estimate. Envelopes are typically derived from the linear prediction coefficients (LPCs) or from the cepstral coefficients (CCs). The improved envelope is obtained either by mapping the degraded envelope onto pre-trained codebooks (classification approach) or by directly estimating it from the degraded envelope (regression approach). In this work, we first investigate the optimal features for envelope representation and codebook generation by a series of oracle tests. We demonstrate that CCs provide better envelope representation compared to using the LPCs. Further, we demonstrate that a unified speech codebook is advantageous compared to the typical codebook that manually splits speech and silence as separate entries. Next, we investigate low-complexity neural network architectures to map degraded envelopes to the optimal codebook entry in practical systems. We confirm that simple recurrent neural networks yield good performance with a low complexity and number of parameters. We also demonstrate that with a careful choice of the feature and architecture, a regression approach can further improve the performance at a lower computational cost. However, as also seen from the oracle tests, the benefit of the two-stage framework is now chiefly limited by the statistical noise floor estimate, leading to only a limited improvement in extremely adverse conditions. This highlights the need for further research on joint estimation of speech and noise for optimum enhancement.

Portable and Non-Intrusive Fill-State Detection for Liquid-Freight Containers Based on Vibration Signals

Hoecke²,

Madhu³

2022

Sensors

Remote, automated querying of fill-states of liquid-freight containers can significantly boost the operational efficiency of rail- and storage-yards. Most existing methods for fill-state detection are intrusive, or require sophisticated instrumentation and specific testing conditions, making them unsuitable here, due to the noisy and changeable surroundings and restricted access to the interior. We present a non-intrusive system that exploits the influence of the fill-state on the container’s response to an external excitation. Using a solenoid and accelerometer mounted on the exterior wall of the container, to generate pulsed excitation and to measure the container response, the fill-state can be detected. The decision can be either a binary (empty/non-empty) label or a (quantised) prediction of the liquid level. We also investigate the choice of the signal features for the detection/classification, and the placement of the sensor and actuator. Experiments conducted in real settings validate the algorithms and the prototypes. Results show that the placement of the sensor and actuator along the base of the container is the best in terms of detection accuracy. In terms of signal features, linear predictive cepstral coefficients possess sufficient discriminative information. The prediction accuracy is 100% for binary classification and exceeds 80% for quantised level prediction.

Drone Ego-Noise Cancellation for Improved Speech Capture using Deep Convolutional Autoencoder Assisted Multistage Beamforming

Kindt

2022

We propose a multistage approach for enhancing speech captured by a drone-mounted microphone array. The key challenge is suppressing the drone ego-noise, which is the major source of interference in such captures. Since the location of the target is not known a priori, we first apply a UNet-based deep convolutional autoencoder (AE) individually to each microphone signal. The AE generates a time-frequency mask ∈ [0, 1] per signal, where high values correspond to time-frequency points with relatively good signal-to-noise ratios (SNRs). The masks are pooled across all microphones and the aggregated mask is used to steer an adaptive, frequency domain beamformer, yielding a signal with an improved SNR. This beamformer output, after being fed back to the AE, now yields an improved mask -which is used for re-focussing the beamformer. This combination of AE and beamformer, which can be applied to the signals in multiple 'passes' is termed multistage beamforming. The approach is developed and evaluated on a self-collected database. For the AE -when used to steer a beamformer -a training target that preserves more speech at the cost of less noise suppression outperforms an aggressive training target that suppresses more noise at the cost of more speech distortion. This, in combination with max-pooling of the multi-channel mask -which also lets through more speech (and noise) compared with median pooling -performs best. The experiments further demonstrate that the multistage approach brings extra benefit to the speech quality and intelligibility when the input SNR is ≥ −10 dB, and yields comprehensible outputs when the input has a SNR above −5 dB.