ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019
DOI: 10.1109/icassp.2019.8683130
|View full text |Cite
|
Sign up to set email alerts
|

Speech Denoising by Parametric Resynthesis

Abstract: This work proposes the use of clean speech vocoder parameters as the target for a neural network performing speech enhancement. These parameters have been designed for text-tospeech synthesis so that they both produce high-quality resyntheses and also are straightforward to model with neural networks, but have not been utilized in speech enhancement until now. In comparison to a matched text-to-speech system that is given the ground truth transcripts of the noisy speech, our model is able to produce more natur… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
10
1

Year Published

2019
2019
2023
2023

Publication Types

Select...
4
2

Relationship

2
4

Authors

Journals

citations
Cited by 9 publications
(11 citation statements)
references
References 18 publications
0
10
1
Order By: Relevance
“…We compare these two PR-neural models with PR-World, our previously proposed model [2], where the WORLD vocoder is used and the intermediate acoustic parameters are the fundamendal frequency, spectral envelope, and band aperiodicity used by WORLD [3]. Note that WORLD does not support 22 kHz sampling rates, so this system generates output at 16 kHz.…”
Section: Methodsmentioning
confidence: 99%
See 2 more Smart Citations
“…We compare these two PR-neural models with PR-World, our previously proposed model [2], where the WORLD vocoder is used and the intermediate acoustic parameters are the fundamendal frequency, spectral envelope, and band aperiodicity used by WORLD [3]. Note that WORLD does not support 22 kHz sampling rates, so this system generates output at 16 kHz.…”
Section: Methodsmentioning
confidence: 99%
“…where J is the number of coupling transformations, K is the number of convolutions, log P (z) is the log-likelihood of the spherical Gaussian with variance ν 2 and in training ν = 1 is used. Note that WaveGlow refers to this parameter as σ, but we use ν to avoid confusion with the logistic function in (2). We use the official published waveGlow implementation 2 with original setup (12 coupling layers, each consisting of 8 layers of dilated convolution with 512 residual and 256 skip connections).…”
Section: Waveglowmentioning
confidence: 99%
See 1 more Smart Citation
“…Parametric Resynthesis (PR) systems [2,3] predict clean acoustic parameters from noisy speech and synthesize speech from these predicted parameters using a speech synthesizer or vocoder. Current speech synthesizers are trained to generate high quality speech for a single speaker.…”
Section: Introductionmentioning
confidence: 99%
“…Current speech synthesizers are trained to generate high quality speech for a single speaker. In previous work we showed that a single speaker PR system can synthesize very high quality clean speech at 22 KHz [2] and performs better than the corresponding TTS system [3]. Hence, a critical question is whether these systems can be generalized to unknown speakers.…”
Section: Introductionmentioning
confidence: 99%