Improved Speech Enhancement with the Wave-U-Net

Craig, Macartney,; Weyde, Tillman

doi:10.48550/arxiv.1811.11307

Cited by 37 publications

(59 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Following previous works of speech enhancement [24,12,25], we apply Perceptual evaluation of speech quality (PESQ) [26], Mean opinion score (MOS) predictor of signal distortion (CSIG), MOS predictor of background-noise intrusiveness (CBAK), MOS predictor of overall signal quality (COVL) [27] and segmental signal-to-ratio noise (SSNR) [28] to evaluate the speech enhancement performance. Table 1 shows that noisy speech without enhancement achieves PESQ, CSIG, CBAK, COVL, SSNR of 1.97, 3.35, 2.44, 2.63 and 1.68 dB respectively.…”

Section: Methodsmentioning

confidence: 99%

Speech enhancement with weakly labelled data from AudioSet

Kong

Liu²,

et al. 2021

Preprint

View full text Add to dashboard Cite

Speech enhancement is a task to improve the intelligibility and perceptual quality of degraded speech signal. Recently, neural networks based methods have been applied to speech enhancement. However, many neural network based methods require noisy and clean speech pairs for training. We propose a speech enhancement framework that can be trained with large-scale weakly labelled AudioSet dataset. Weakly labelled data only contain audio tags of audio clips, but not the onset or offset times of speech. We first apply pretrained audio neural networks (PANNs) to detect anchor segments that contain speech or sound events in audio clips. Then, we randomly mix two detected anchor segments containing speech and sound events as a mixture, and build a conditional source separation network using PANNs predictions as soft conditions for speech enhancement. In inference, we input a noisy speech signal with the one-hot encoding of "Speech" as a condition to the trained system to predict enhanced speech. Our system achieves a PESQ of 2.28 and an SSNR of 8.75 dB on the VoiceBank-DEMAND dataset, outperforming the previous SEGAN system of 2.16 and 7.73 dB respectively.

show abstract

Section: Methodsmentioning

confidence: 99%

Speech enhancement with weakly labelled data from AudioSet

Kong

Liu²,

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Moreover, word error rate (WER) is also computed to assess the effects of the enhancement for speech recognition purposes. For this purpose, we use a Wav2Vec [28] architecture pre-trained on Librispeech 960h 7 . The final metric for this task is a combination of these two measures given by (ST OI + (1 − W ER))/2.…”

Section: Task 1: 3d Speech Enhancement In Office Reverberant Environmentmentioning

confidence: 99%

“…Neural beamforming techniques as Filter and Sum Networks (FaS-Net) [5] provide state-of-the art results for Ambisonics-based SE and are usually suitable for low-latency scenarios. Also U-Net-based approaches provide competitive results in this context, both for monaural [6,7] and multichannel SE tasks [8], at the expense of higher computational power demand. Other techniques to perform SE include recurrent neural networks (RNNs) [9], graph-based spectral subtraction [10], discriminative learning [11], dilated convolutions [12,13].…”

Section: Introductionmentioning

confidence: 99%

L3DAS21 Challenge: Machine Learning for 3D Audio Signal Processing

Guizzo¹,

Gramaccioni²,

Jamili³

et al. 2021

Preprint

View full text Add to dashboard Cite

The L3DAS21 Challenge 1 is aimed at encouraging and fostering collaborative research on machine learning for 3D audio signal processing, with particular focus on 3D speech enhancement (SE) and 3D sound localization and detection (SELD). Alongside with the challenge, we release the L3DAS21 dataset, a 65 hours 3D audio corpus, accompanied with a Python API that facilitates the data usage and results submission stage. Usually, machine learning approaches to 3D audio tasks are based on single-perspective Ambisonics recordings or on arrays of single-capsule microphones. We propose, instead, a novel multichannel audio configuration based multiple-source and multiple-perspective Ambisonics recordings, performed with an array of two first-order Ambisonics microphones. To the best of our knowledge, it is the first time that a dual-mic Ambisonics configuration is used for these tasks. We provide baseline models and results for both tasks, obtained with state-of-the-art architectures: FaSNet for SE and SELDnet for SELD.This report is aimed at providing all needed information to participate in the L3DAS21 Challenge, illustrating the details of the L3DAS21 dataset, the challenge tasks and the baseline models.

show abstract

“…U-Net was first introduced on image segmentation and attained several state-of-the-art results [19]. Recently, Wave-U-Net was proposed by [20] to improve audio source separation and speech enhancement [21]. However, the previous U-Net-based methods did not consider the sequenceto-sequence mechanism such as temporal dependency.…”

Section: Model Defense By U-net Based Speech Enhancementmentioning

confidence: 99%

Characterizing Speech Adversarial Examples Using Self-Attention U-Net Enhancement

Yang,

Qi,

Chen

et al. 2020

Preprint

View full text Add to dashboard Cite

Recent studies have highlighted adversarial examples as ubiquitous threats to the deep neural network (DNN) based speech recognition systems. In this work, we present a U-Net based attention model, U-NetAt, to enhance adversarial speech signals. Specifically, we evaluate the model performance by interpretable speech recognition metrics and discuss the model performance by the augmented adversarial training. Our experiments show that our proposed U-NetAt improves the perceptual evaluation of speech quality (PESQ) from 1.13 to 2.78, speech transmission index (STI) from 0.65 to 0.75, shortterm objective intelligibility (STOI) from 0.83 to 0.96 on the task of speech enhancement with adversarial speech examples. We conduct experiments on the automatic speech recognition (ASR) task with adversarial audio attacks. We find that (i) temporal features learned by the attention network are capable of enhancing the robustness of DNN based ASR models; (ii) the generalization power of DNN based ASR model could be enhanced by applying adversarial training with an additive adversarial data augmentation. The ASR metric on word-error-rates (WERs) shows that there is an absolute 2.22 % decrease under gradient-based perturbation, and an absolute 2.03 % decrease, under evolutionary-optimized perturbation, which suggests that our enhancement models with adversarial training can further secure a resilient ASR system.

show abstract

Improved Speech Enhancement with the Wave-U-Net

Cited by 37 publications

References 0 publications

Speech enhancement with weakly labelled data from AudioSet

Speech enhancement with weakly labelled data from AudioSet

L3DAS21 Challenge: Machine Learning for 3D Audio Signal Processing

Characterizing Speech Adversarial Examples Using Self-Attention U-Net Enhancement

Contact Info

Product

Resources

About