ItôWave: Itô Stochastic Differential Equation is all You Need for Wave Generation

Wu, Shoule; Shi, Ziqiang

doi:10.1109/icassp43922.2022.9746153

Cited by 6 publications

(4 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Pioneering work WaveGrad [7] Code / Project DiffWave [45] Code Efficient vocoder BDDM [48] Code InferGrad [9] WaveFit [43] Project Statistical improvement DDGM [70] PriorGrad [50] Project ItôWave [125] Project SpecGrad [44] End-to-end Pioneering work WaveGrad 2 [8] Code / Project CRASH [90] Project Efficient model FastDiff [26] Code / Project Further improvements DAG [79] Itôn [99] Project statistical parametric speech synthesis (SPSS) was a popular method [115,116,132,133,137] consisting of three stages. As shown in Figure 1 (a), the text input is first converted to linguistic features, then acoustic features, and to the waveform in the last stage.…”

Section: Overview Of the Text-to-speech Developmentmentioning

confidence: 99%

“…Other improvements. ItôWave [125] is the first to propose a vocoder based on linear Itô SDE. Based on Melspectrogram, ItôWave [125] achieves higher MOS with 95% confidence than WaveGrad [7] and DiffWave [45].…”

Section: Improvement From Statistical Perspectivementioning

confidence: 99%

“…ItôWave [125] is the first to propose a vocoder based on linear Itô SDE. Based on Melspectrogram, ItôWave [125] achieves higher MOS with 95% confidence than WaveGrad [7] and DiffWave [45]. Spec-Grad [44] proposes to adopt the spectral envelope of diffusion noise to the conditional log-mel spectrum, which improves the sound quality especially for the high-quality bands.…”

Section: Improvement From Statistical Perspectivementioning

confidence: 99%

“…Model based on Itô SDE. Inspired by ItôWave [125], Itôn [99] proposes an end-to-end model for speech synthesis based on Itô SDE. Apart from the encoder-decoder architecture, Itôn [99] introduces a dual-denoiser structure for the generation of mel-spectrogram and waveform, respectively.…”

Section: End-to-end Frameworkmentioning

confidence: 99%

See 3 more Smart Citations

Audio Diffusion Model for Speech Synthesis: A Survey on Text To Speech and Speech Enhancement in Generative AI

Zhang¹,

Zhang²,

Zheng³

et al. 2023

Preprint

View full text Add to dashboard Cite

Generative AI has demonstrated impressive performance in various fields, among which speech synthesis is an interesting direction.With the diffusion model as the most popular generative model, numerous works have attempted two active tasks: text to speech and speech enhancement. This work conducts a survey on audio diffusion model, which is complementary to existing surveys that either lack the recent progress of diffusion-based speech synthesis or highlight an overall picture of applying diffusion model in multiple fields. Specifically, this work first briefly introduces the background of audio and diffusion model. As for the text-to-speech task, we divide the methods into three categories based on the stage where diffusion model is adopted: acoustic model, vocoder and end-to-end framework. Moreover, we categorize various speech enhancement tasks by either certain signals are removed or added into the input speech. Comparisons of experimental results and discussions are also covered in this survey.

show abstract