Denoising-and-Dereverberation Hierarchical Neural Vocoder for Statistical Parametric Speech Synthesis

Ai, Yang; Ling, Zhen-Hua; Wei-lu, WU; Li, Ang

doi:10.1109/taslp.2022.3182268

Cited by 5 publications

(3 citation statements)

References 39 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In our previous work, we proposed the HiNet vocoder [37] and its variant [39]. We have also successfully applied the HiNet vocoder in the reverberation modeling task [40] and denoising and dereveberation task [41], [42], respectively. As shown in Figure 1, the HiNet vocoder uses an ASP and a PSP to predict the frame-level log amplitude spectrum and phase spectrum of a waveform, respectively.…”

Section: Hinetmentioning

confidence: 99%

See 1 more Smart Citation

APNet: An All-Frame-Level Neural Vocoder Incorporating Direct Prediction of Amplitude and Phase Spectra

Ling

2023

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

This paper presents a novel neural vocoder named APNet which reconstructs speech waveforms from acoustic features by predicting amplitude and phase spectra directly. The APNet vocoder is composed of an amplitude spectrum predictor (ASP) and a phase spectrum predictor (PSP). The ASP is a residual convolution network which predicts frame-level log amplitude spectra from acoustic features. The PSP also adopts a residual convolution network using acoustic features as input, then passes the output of this network through two parallel linear convolution layers respectively, and finally integrates into a phase calculation formula to estimate frame-level phase spectra. Finally, the outputs of ASP and PSP are combined to reconstruct speech waveforms by inverse short-time Fourier transform (ISTFT). All operations of the ASP and PSP are performed at the frame level. We train the ASP and PSP jointly and define multilevel loss functions based on amplitude mean square error, phase antiwrapping error, short-time spectral inconsistency error and time domain reconstruction error. Experimental results show that our proposed APNet vocoder achieves an approximately 8x faster inference speed than HiFi-GAN v1 on a CPU due to the allframe-level operations, while its synthesized speech quality is comparable to HiFi-GAN v1. The synthesized speech quality of the APNet vocoder is also better than that of several equally efficient models. Ablation experiments also confirm that the proposed parallel phase estimation architecture is essential to phase modeling and the proposed loss functions are helpful for improving the synthesized speech quality.

show abstract

Section: Hinetmentioning

confidence: 99%

“…• HiNet: The HiNet vocoder [37] we previously proposed. The model configurations were the same as those used in the Baseline-HiNet model of our previous work [42].…”

Section: B Comparison Among Neural Vocodersmentioning

confidence: 99%

APNet: An All-Frame-Level Neural Vocoder Incorporating Direct Prediction of Amplitude and Phase Spectra

Ling

2023

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

show abstract

“…Here, the amplitude extension model was borrowed from our previous work [39] and included 2 bidirectional gated recurrent unit (GRU)-based recurrent layers, each with 1024 nodes (512 forward ones and 512 backward ones), 2 convolutional layers, each with 2048 nodes (filter width=9), and a feedforward linear output layer with 256 nodes. The generative adversarial network (GAN) with two discriminators which conducted convolution along the frequency and time axis [39] was applied to the amplitude extension model at the training stage.…”

Section: B Speech Generation Tasksmentioning

confidence: 99%

Neural Speech Phase Prediction Based on Parallel Estimation Architecture and Anti-Wrapping Losses

Ling

2023

ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

This paper presents a novel neural speech phase prediction model which predicts wrapped phase spectra directly from amplitude spectra. The proposed model is a cascade of a residual convolutional network and a parallel estimation architecture. The parallel estimation architecture is a core module for direct wrapped phase prediction. This architecture consists of two parallel linear convolutional layers and a phase calculation formula, imitating the process of calculating the phase spectra from the real and imaginary parts of complex spectra and strictly restricting the predicted phase values to the principal value interval. To avoid the error expansion issue caused by phase wrapping, we design anti-wrapping training losses defined between the predicted wrapped phase spectra and natural ones by activating the instantaneous phase error, group delay error and instantaneous angular frequency error using an anti-wrapping function. We mathematically demonstrate that the anti-wrapping function should possess three properties, namely parity, periodicity and monotonicity. We also achieve low-latency streamable phase prediction by combining causal convolutions and knowledge distillation training strategies. For both analysissynthesis and specific speech generation tasks, experimental results show that our proposed neural speech phase prediction model outperforms the iterative phase estimation algorithms and neural network-based phase prediction methods in terms of phase prediction precision, efficiency and robustness. Compared with HiFi-GAN-based waveform reconstruction method, our proposed model also shows outstanding efficiency advantages while ensuring the quality of synthesized speech. To the best of our knowledge, we are the first to directly predict speech phase spectra from amplitude spectra only via neural networks.

show abstract

A Neural Denoising Vocoder for Clean Waveform Generation from Noisy Mel-Spectrogram Based on Amplitude and Phase Predictions

Du,

Lu,

et al. 2024

Communications in Computer and Information Science

View full text Add to dashboard Cite

Denoising-and-Dereverberation Hierarchical Neural Vocoder for Statistical Parametric Speech Synthesis

Cited by 5 publications

References 39 publications

APNet: An All-Frame-Level Neural Vocoder Incorporating Direct Prediction of Amplitude and Phase Spectra

APNet: An All-Frame-Level Neural Vocoder Incorporating Direct Prediction of Amplitude and Phase Spectra

Neural Speech Phase Prediction Based on Parallel Estimation Architecture and Anti-Wrapping Losses

A Neural Denoising Vocoder for Clean Waveform Generation from Noisy Mel-Spectrogram Based on Amplitude and Phase Predictions

Contact Info

Product

Resources

About