ITÔN: End-to-end audio generation with Itô stochastic differential equations

Shi, Ziqiang; Wu, Shoule

doi:10.1016/j.dsp.2022.103781

Cited by 1 publication

(4 citation statements)

References 3 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Pioneering work WaveGrad [7] Code / Project DiffWave [45] Code Efficient vocoder BDDM [48] Code InferGrad [9] WaveFit [43] Project Statistical improvement DDGM [70] PriorGrad [50] Project ItôWave [125] Project SpecGrad [44] End-to-end Pioneering work WaveGrad 2 [8] Code / Project CRASH [90] Project Efficient model FastDiff [26] Code / Project Further improvements DAG [79] Itôn [99] Project statistical parametric speech synthesis (SPSS) was a popular method [115,116,132,133,137] consisting of three stages. As shown in Figure 1 (a), the text input is first converted to linguistic features, then acoustic features, and to the waveform in the last stage.…”

Section: Overview Of the Text-to-speech Developmentmentioning

confidence: 99%

“…Model based on Itô SDE. Inspired by ItôWave [125], Itôn [99] proposes an end-to-end model for speech synthesis based on Itô SDE. Apart from the encoder-decoder architecture, Itôn [99] introduces a dual-denoiser structure for the generation of mel-spectrogram and waveform, respectively.…”

Section: End-to-end Frameworkmentioning

confidence: 99%

“…Inspired by ItôWave [125], Itôn [99] proposes an end-to-end model for speech synthesis based on Itô SDE. Apart from the encoder-decoder architecture, Itôn [99] introduces a dual-denoiser structure for the generation of mel-spectrogram and waveform, respectively. Moreover, Itôn [99] adopts a two-stage training strategy that trains the encoder and Mel denoiser in the first stage, and the wave denoiser in the second stage.…”

Section: End-to-end Frameworkmentioning

confidence: 99%

“…Apart from the encoder-decoder architecture, Itôn [99] introduces a dual-denoiser structure for the generation of mel-spectrogram and waveform, respectively. Moreover, Itôn [99] adopts a two-stage training strategy that trains the encoder and Mel denoiser in the first stage, and the wave denoiser in the second stage. UNIVERSE [96] Apart from text-to-speech generation, diffusion models have also been widely used in improving the quality of existing degraded audio.…”

Section: End-to-end Frameworkmentioning

confidence: 99%

See 3 more Smart Citations

Audio Diffusion Model for Speech Synthesis: A Survey on Text To Speech and Speech Enhancement in Generative AI

Zhang¹,

Zhang²,

Zheng³

et al. 2023

Preprint

View full text Add to dashboard Cite

Generative AI has demonstrated impressive performance in various fields, among which speech synthesis is an interesting direction.With the diffusion model as the most popular generative model, numerous works have attempted two active tasks: text to speech and speech enhancement. This work conducts a survey on audio diffusion model, which is complementary to existing surveys that either lack the recent progress of diffusion-based speech synthesis or highlight an overall picture of applying diffusion model in multiple fields. Specifically, this work first briefly introduces the background of audio and diffusion model. As for the text-to-speech task, we divide the methods into three categories based on the stage where diffusion model is adopted: acoustic model, vocoder and end-to-end framework. Moreover, we categorize various speech enhancement tasks by either certain signals are removed or added into the input speech. Comparisons of experimental results and discussions are also covered in this survey.

show abstract