FloWaveNet : A Generative Flow for Raw Audio

Kim, Sung-Won; Lee, Sang-gil; Song, Jongyoon; Kim, Jaehyeon; Yoon, Sungroh

doi:10.48550/arxiv.1811.02155

Cited by 30 publications

(48 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Although non-linear quantization processes such as µ-law received much attention the last years, the majority of the existing papers use a normalized high resolution signal as input [14]. Finally, other applications include linear quantization of the input waveform [15] [16] and different designs for most and less significant bits [17].…”

Section: A Waveform -Raw Audiomentioning

confidence: 99%

“…At last, other variations of conditioning have been introduced as well. Kim et al [14] adjusted conditioning through the loss function. They estimated an auxiliary probability density using mel-spectrograms for local conditioning.…”

Section: Othermentioning

confidence: 99%

“…The implementation has been proposed by NVIDIA and it is able to generate sound in real time. Insightful alternatives have also been proposed on normalising flows by using only a single loss function, without any auxiliary loss terms [14] or by applying dilated 2-D convolutional layers [64].…”

Section: B Normalizing Flowmentioning

confidence: 99%

“…A final evaluation metric includes a Negative Log Likelihood (NLL) [17] [15] and an objective Conditional Log Likelihood (CLL) [14] usually measured in bits per audio sample.…”

Section: F Log Likelihoodmentioning

confidence: 99%

See 3 more Smart Citations

Audio representations for deep learning in sound synthesis: A review

Anastasia¹,

O’Leary²

2022

Preprint

View full text Add to dashboard Cite

The rise of deep learning algorithms has led many researchers to withdraw from using classic signal processing methods for sound generation. Deep learning models have achieved expressive voice synthesis, realistic sound textures, and musical notes from virtual instruments. However, the most suitable deep learning architecture is still under investigation. The choice of architecture is tightly coupled to the audio representations. A sound's original waveform can be too dense and rich for deep learning models to deal with efficientlyand complexity increases training time and computational cost. Also, it does not represent sound in the manner in which it is perceived. Therefore, in many cases, the raw audio has been transformed into a compressed and more meaningful form using upsampling, feature-extraction, or even by adopting a higher level illustration of the waveform. Furthermore, conditional on the form chosen, additional conditioning representations, different model architectures, and numerous metrics for evaluating the reconstructed sound have been investigated. This paper provides an overview of audio representations applied to sound synthesis using deep learning. Additionally, it presents the most significant methods for developing and evaluating a sound synthesis architecture using deep learning models, always depending on the audio representation.

show abstract

Section: A Waveform -Raw Audiomentioning

confidence: 99%

Section: Othermentioning

confidence: 99%

Section: B Normalizing Flowmentioning

confidence: 99%

“…A final evaluation metric includes a Negative Log Likelihood (NLL) [17] [15] and an objective Conditional Log Likelihood (CLL) [14] usually measured in bits per audio sample.…”

Section: F Log Likelihoodmentioning

confidence: 99%

See 2 more Smart Citations

Audio representations for deep learning in sound synthesis: A review

Anastasia¹,

O’Leary²

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…The development of neural end-to-end text-to-speech (TTS) models [1,2,3,4,5,6] has greatly promoted speech synthesis. Generally, with a well-trained neural acoustic model [2,5,6,7] and a neural vocoder [8,9,10,11], or alternatively using fully end-to-end models [12,13,14] which directly construct wave signals from text input, it is able to synthesize high-quality neutral speech. Recently, much attention has been attracted to synthesizing expressive speech, such as stylized speech [15,16], emotional speech [17,18,19,20,21,22], and also singing voice [23,24].…”

Section: Introductionmentioning

confidence: 99%

Opencpop: A High-Quality Open Source Chinese Popular Song Corpus for Singing Voice Synthesis

Wang¹,

Wang²,

Zhu³

et al. 2022

Preprint

View full text Add to dashboard Cite

This paper introduces Opencpop, a publicly available highquality Mandarin singing corpus designed for singing voice synthesis (SVS). The corpus consists of 100 popular Mandarin songs performed by a female professional singer. Audio files are recorded with studio quality at a sampling rate of 44,100 Hz and the corresponding lyrics and musical scores are provided. All singing recordings have been phonetically annotated with phoneme boundaries and syllable (note) boundaries. To demonstrate the reliability of the released data and to provide a baseline for future research, we built baseline deep neural network-based SVS models and evaluated them with both objective metrics and subjective mean opinion score (MOS) measure. Experimental results show that the best SVS model trained on our database achieves 3.70 MOS, indicating the reliability of the provided corpus. Opencpop is released to the open-source community WeNet 1 , and the corpus, as well as synthesized demos, can be found on the project homepage 2 .

show abstract

Periodnet: A Non-Autoregressive Waveform Generation Model with a Structure Separating Periodic and Aperiodic Components

Hono

Takaki

Hashimoto

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

We propose PeriodNet, a non-autoregressive (non-AR) waveform generation model with a new model structure for modeling periodic and aperiodic components in speech waveforms. The non-AR waveform generation models can generate speech waveforms parallelly and can be used as a speech vocoder by conditioning an acoustic feature. Since a speech waveform contains periodic and aperiodic components, both components should be appropriately modeled to generate a high-quality speech waveform. However, it is difficult to decompose the components from a natural speech waveform in advance. To address this issue, we propose a parallel model and a series model structure separating periodic and aperiodic components. The features of our proposed models are that explicit periodic and aperiodic signals are taken as input, and external periodic/aperiodic decomposition is not needed in training. Experiments using a singing voice corpus show that our proposed structure improves the naturalness of the generated waveform. We also show that the speech waveforms with a pitch outside of the training data range can be generated with more naturalness.

show abstract

FloWaveNet : A Generative Flow for Raw Audio

Cited by 30 publications

References 0 publications

Audio representations for deep learning in sound synthesis: A review

Audio representations for deep learning in sound synthesis: A review

Opencpop: A High-Quality Open Source Chinese Popular Song Corpus for Singing Voice Synthesis

Periodnet: A Non-Autoregressive Waveform Generation Model with a Structure Separating Periodic and Aperiodic Components

Contact Info

Product

Resources

About