Upsampling artifacts in neural audio synthesis

Pons, Jordi; Pascual, Santiago; Cengarle, Giulio; Serrà, Joan

doi:10.48550/arxiv.2010.14356

Cited by 2 publications

(5 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A transposed convolution operation forms the same connectivity as a direct convolution but in the backward direction, which requires upsampling the input into an output of larger dimensions. Transposed convolutions are commonly used in CNN training and in emerging CNN workloads [40,41,46,47,49,64,65,[67][68][69][70][71][72]. Figure 1 .…”

Section: Transposed Convolutionmentioning

confidence: 99%

“…Dilated convolutions are commonly used in CNN training and in emerging CNN workloads [40,41,46,47,49,64,65,[73][74][75][76][77]. Figure 1 3 shows a dilated convolution example that calculates the filter gradients (δW xy ) with dilation rate = 2 (i.e., stride 2) in the backward propagation pass of CNN training.…”

Section: Dilated Convolutionmentioning

confidence: 99%

“…For example, both kernels are employed in applications requiring significant upsampling or downsampling to process high-resolution media such as image generation (using Generative Adversarial Networks (GANs) and Variational Auto-encoders (VAEs) [7,39]), image super-resolution [40][41][42], and image segmentation [43,44]. Additionally, more emerging machine learning works in text-to-speech generation [45], speech recognition [46], and audio synthesis [47] use dilated convolutions. Other experimental machine learning models, such as hierarchical capsule networks [48] and dilated residual networks [49] for improved image modeling, use both these convolution types.…”

Section: Introductionmentioning

confidence: 99%

“…While these works demonstrate efficient execution of direct convolutions (i.e., regular or 'standard' convolutions), we find that existing dataflows for transposed and dilated convolutions are poorly tailored for these architectures, causing significant bottlenecks for emerging edge workloads that use transpose and dilated convolutions. Despite this issue, these workloads are of growing interest to manufacturers, because they can enable: (1) on-device model training for improved user data privacy [59][60][61], (2) high-resolution image generation critical for augmented reality [62,63], (3) real-time speech recognition and generation [42,45], and many other applications employing dilated and transposed convolutions [40,41,46,47,49,[64][65][66][67][68][69][70][71][72][72][73][74][75][76][77].…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

EcoFlow: Efficient Convolutional Dataflows for Low-Power Neural Network Accelerators

Orosa¹,

Koppula²,

Umuroglu³

et al. 2022

Preprint

View full text Add to dashboard Cite

Dilated and transposed convolutions are widely used in modern convolutional neural networks (CNNs). These kernels are used extensively during CNN training and inference of applications such as image segmentation and high-resolution image generation. Although these kernels have grown in popularity, they stress current compute systems due to their high memory intensity, exascale compute demands, and large energy consumption. We find that commonly-used low-power CNN inference accelerators based on spatial architectures are not optimized for both of these convolutional kernels. Dilated and transposed convolutions introduce significant zero padding when mapped to the underlying spatial architecture, significantly degrading performance and energy efficiency. Existing approaches that address this issue require significant design changes to the otherwise simple, efficient, and well-adopted architectures used to compute direct convolutions. To address this challenge, we propose EcoFlow, a new set of dataflows and mapping algorithms for dilated and transposed convolutions. These algorithms are tailored to execute efficiently on existing low-cost, small-scale spatial architectures and requires minimal changes to the network-on-chip of existing accelerators. At its core, EcoFlow eliminates zero padding through careful dataflow orchestration and data mapping tailored to the spatial architecture. EcoFlow enables flexible and high-performance transpose and dilated convolutions on architectures that are otherwise optimized for CNN inference. We evaluate the efficiency of our dataflows on CNN training workloads and Generative Adversarial Network (GAN) training workloads. Experiments in our new cycle-accurate spatial architecture simulator show that EcoFlow 1) reduces end-to-end CNN training time between 7-85%, and 2) improves end-to-end GAN training performance between 29-42%, compared to state-of-the-art CNN inference accelerators. [Open-Source Artifact]We open-source both our Spatial Architecture Simulator for Machine Learning (SASiML) and the SASiML compiler to help enable the development of new dataflows and high-accuracy simulation environments for new spatial architectures and dataflows. This can be freely found at https://github.com/CMU-SAFARI/sasiml.

show abstract

Section: Transposed Convolutionmentioning

confidence: 99%

Section: Dilated Convolutionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

EcoFlow: Efficient Convolutional Dataflows for Low-Power Neural Network Accelerators

Orosa¹,

Koppula²,

Umuroglu³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Mel spectrogram upsampling is done by alternating nearestneighbor upsampling and 1D convolution layer with the kernel size of 3. We use nearest-neighbor upsampling over transposed convolution as [24,25] report various advantages (e.g. less distortion in the frequency domain, less checkerboard artifacts, and better preservation of information from low resolution).…”

Section: Proposed Architecturementioning

confidence: 99%

GAN Vocoder: Multi-Resolution Discriminator Is All You Need

You¹,

Kim²,

Nam³

et al. 2021

Interspeech 2021

View full text Add to dashboard Cite

Several of the latest GAN-based vocoders show remarkable achievements, outperforming autoregressive and flow-based competitors in both qualitative and quantitative measures while synthesizing orders of magnitude faster. In this work, we hypothesize that the common factor underlying their success is the multi-resolution discriminating framework, not the minute details in architecture, loss function, or training strategy. We experimentally test the hypothesis by evaluating six different generators paired with one shared multi-resolution discriminating framework. For all evaluative measures with respect to textto-speech syntheses and for all perceptual metrics, their performances are not distinguishable from one another, which supports our hypothesis.

show abstract

Upsampling artifacts in neural audio synthesis

Cited by 2 publications

References 11 publications

EcoFlow: Efficient Convolutional Dataflows for Low-Power Neural Network Accelerators

EcoFlow: Efficient Convolutional Dataflows for Low-Power Neural Network Accelerators

GAN Vocoder: Multi-Resolution Discriminator Is All You Need

Contact Info

Product

Resources

About