2020
DOI: 10.1109/access.2020.3040797
|View full text |Cite
|
Sign up to set email alerts
|

High-Fidelity Audio Generation and Representation Learning With Guided Adversarial Autoencoder

Abstract: Generating high-fidelity conditional audio samples and learning representation from unlabelled audio data are two challenging problems in machine learning research. Recent advances in the Generative Adversarial Neural Networks (GAN) architectures show great promise in addressing these challenges. To learn powerful representation using GAN architecture, it requires superior sample generation quality, which requires an enormous amount of labelled data. In this paper, we address this issue by proposing Guided Adv… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 11 publications
(4 citation statements)
references
References 43 publications
0
4
0
Order By: Relevance
“…(3) Synthesis and Classification Evaluation Metrics: For the speech synthesis task, the Frechet Inception Distance (FID) [38] is selected to evaluate the sample quality that computes the Frechet Distance [39] between two multivariate Gaussian distributions for the synthetic and real samples. We follow a standard FID setup in [15] to evaluate the quality of over 10,000 synthetic speech samples generated from random noise. For the speech command classification task, classification accuracy is used to evaluate the student model.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…(3) Synthesis and Classification Evaluation Metrics: For the speech synthesis task, the Frechet Inception Distance (FID) [38] is selected to evaluate the sample quality that computes the Frechet Distance [39] between two multivariate Gaussian distributions for the synthetic and real samples. We follow a standard FID setup in [15] to evaluate the quality of over 10,000 synthetic speech samples generated from random noise. For the speech command classification task, classification accuracy is used to evaluate the student model.…”
Section: Methodsmentioning
confidence: 99%
“…PATE-GAN [13] tries to overcome this issue by incorporating a generative block jointly trained with the PATE block; the goal is providing enough synthetic data to train deep models effectively. Unfortunately, PATE-GAN does not work well for high dimensional data synthesis (e.g., images), as demonstrated in recent studies [14,15]. Moreover, generating speech samples is a challenging task, as shown in recent studies about neural vocoders [16,17].…”
Section: Introductionmentioning
confidence: 99%
“…In our previous work [51], we first proposed an IoT System of Systems framework of audio generation for visual inputs exploiting BigGAN [14]. Recently, authors in [52] utilized BigGAN architecture for adversarial audio generation in a guided manner. Our proposed FoleyGAN architecture is a novel approach to apply BigGAN in the movie sound production domain where we are synthesizing the audio for silent movie clips using visual and temporal guidance.…”
Section: Audio Generation With Ganmentioning
confidence: 99%
“…In our previous work [43] for the first we propose a System of Systems framework of audio generation for visual inputs exploiting BigGAN [14]. Recently, authors in [44] utilizes BigGAN architecture for adversarial audio generation in guided manner. Our proposed FoleyGAN architecture is a noble approach to apply BigGAN in movie sound production domain where we are synthesizing the audio for silent movie clips taking visual guidance.…”
Section: Audio Generation With Ganmentioning
confidence: 99%