“…Denoising diffusion probabilistic models (diffusion models for short) have achieved the state-of-theart (SOTA) generation results in various tasks, including image [34,22,8,7,33,39,44] and super resolution image generation [13,31,41,25], text-to-image generation [23,11,14,28], text-to-speech synthesis [4,15,27,17,16,5] and speech enhancement [20,21,42]. Especially, in audio synthesis, diffusion models have shown strong ability in modelling both spectrogram features [27,17] and raw waveforms [4,15,5].…”