2020
DOI: 10.48550/arxiv.2006.05694
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

HiFi-GAN: High-Fidelity Denoising and Dereverberation Based on Speech Deep Features in Adversarial Networks

Abstract: Real-world audio recordings are often degraded by factors such as noise, reverberation, and equalization distortion. This paper introduces HiFi-GAN, a deep learning method to transform recorded speech to sound as though it had been recorded in a studio. We use an end-to-end feed-forward WaveNet architecture, trained with multi-scale adversarial discriminators in both the time domain and the time-frequency domain. It relies on the deep feature matching losses of the discriminators to improve the perceptual qual… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
21
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
6
1

Relationship

0
7

Authors

Journals

citations
Cited by 18 publications
(21 citation statements)
references
References 41 publications
0
21
0
Order By: Relevance
“…An asterisk means that the difference between ModDW and DW is statistically significant with p < 0.05. (3,8), stride size is (1,4), and the channel size is kept the same as the input. For the 2-D CNN, the settings are the same and the channel size is 64, 16, 8, 4, 1.…”
Section: Deep Cnn For Conditioner Upsamplingmentioning
confidence: 99%
See 2 more Smart Citations
“…An asterisk means that the difference between ModDW and DW is statistically significant with p < 0.05. (3,8), stride size is (1,4), and the channel size is kept the same as the input. For the 2-D CNN, the settings are the same and the channel size is 64, 16, 8, 4, 1.…”
Section: Deep Cnn For Conditioner Upsamplingmentioning
confidence: 99%
“…Speech enhancement (SE) of degraded speech is important across many applications including telecommunications [1], speech recognition [2], etc. Many methods have been developed for similar applications, such as speech denoising, dereverberation and equalization [3,4]. Most current SE methods are designed to remove background noise, in many cases using an additive noise model.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…VoCo [65] Dereverb [66] HiFi-GAN [67] FFTnet [ precision@K over all test queries turns out to be 0.97 for K = 10 and 0.95 for K = 25. These high precision retrievals show that the embeddings indeed capture quality.…”
Section: Objective Evaluationsmentioning
confidence: 99%
“…We consider an exhaustive set of 10 different datasets for this evaluation. These datasets span over a variety of well-known speech problems; (1) Speech Synthesis (VoCo [65] and FFTnet [68]), (2) Speech Enhancement (Dereverberation [66], Noizeus [71], HiFi-GAN [67]), (3) Voice Conversion (VCC-2018 [70]), (4) Speech Source Separation (PEASS [69]), (5) Telephony Degradations [72], (6) Bandwidth Extension (BWE [73]), and (7) General Degradation's (Simulated [6]). Please refer to supplementary material for details about these datasets.…”
Section: Subjective Evaluationsmentioning
confidence: 99%