Review of end-to-end speech synthesis technology based on deep learning

Mu, Zhaoxi; Yang, Xinyu; Dong, Yizhuo

doi:10.48550/arxiv.2104.09995

Cited by 8 publications

(6 citation statements)

References 179 publications

(199 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…GANs are revolutionizing music creation by tapping into existing compositions' patterns and structures [71]. This technology fosters original music composition and assists musicians in their creative journey.…”

Section: Music Generationmentioning

confidence: 99%

Ten years of generative adversarial nets (GANs): a survey of the state-of-the-art

Chakraborty,

Reddy K S,

Naik

et al. 2024

Mach. Learn.: Sci. Technol.

View full text Add to dashboard Cite

Generative Adversarial Networks (GANs) have rapidly emerged as powerful tools for generating realistic and diverse data across various domains, including computer vision and other applied areas, since their inception in 2014. Consisting of a discriminative network and a generative network engaged in a Minimax game, GANs have revolutionized the field of generative modeling. In February 2018, GAN secured the leading spot on the ``Top Ten Global Breakthrough Technologies List'' issued by the Massachusetts Science and Technology Review. Over the years, numerous advancements have been proposed, leading to a rich array of GAN variants, such as conditional GAN, Wasserstein GAN, CycleGAN, and StyleGAN, among many others. This survey aims to provide a general overview of GANs, summarizing the latent architecture, validation metrics, and application areas of the most widely recognized variants. We also delve into recent theoretical developments, exploring the profound connection between the adversarial principle underlying GAN and Jensen-Shannon divergence while discussing the optimality characteristics of the GAN framework. The efficiency of GAN variants and their model architectures will be evaluated along with training obstacles as well as training solutions. In addition, a detailed discussion will be provided, examining the integration of GANs with newly developed deep learning frameworks such as Transformers, Physics-Informed Neural Networks, Large Language models, and Diffusion models. Finally, we reveal several issues as well as future research outlines in this field.

show abstract

Section: Music Generationmentioning

confidence: 99%

Ten years of generative adversarial nets (GANs): a survey of the state-of-the-art

Chakraborty,

Reddy K S,

Naik

et al. 2024

Mach. Learn.: Sci. Technol.

View full text Add to dashboard Cite

show abstract

“…Harshvardhan et al [76] instead covered deep generation as part of generation in machine learning and proposed future directions. In addition to surveys on general deep data generation, other surveys may focus on the deep data generation in specific domains including graph generation [77][78][79], image synthesis [80,81], text generation [82,83] and audio generation [84][85][86].…”

Section: Relationship With Existing Surveysmentioning

confidence: 99%

Controllable Data Generation by Deep Learning: A Review

Wang¹,

Du²,

Guo³

et al. 2022

Preprint

View full text Add to dashboard Cite

Designing and generating new data under targeted properties has been attracting various critical applications such as molecule design, image editing and speech synthesis. Traditional hand-crafted approaches heavily rely on expertise experience and intensive human efforts, yet still suffer from the insufficiency of scientific knowledge and low throughput to support effective and efficient data generation. Recently, the advancement of deep learning induces expressive methods that can learn the underlying representation and properties of data. Such capability provides new opportunities in figuring out the mutual relationship between the structural patterns and functional properties of the data and leveraging such relationship to generate structural data given the desired properties. This article provides a systematic review of this promising research area, commonly known as controllable deep data generation. Firstly, the potential challenges are raised and preliminaries are provided. Then the controllable deep data generation is formally defined, a taxonomy on various techniques is proposed and the evaluation metrics in this specific domain are summarized. After that, exciting applications of controllable deep data generation are introduced and existing works are experimentally analyzed and compared. Finally, the promising future directions of controllable deep data generation are highlighted and five potential challenges are identified.

show abstract

“…The widespread use of deep learning has significantly advanced the development of speech synthesis technology [1]. This innovation not only enables artificial intelligence technology to expand its application scope to encompass more audio synthesis scenarios and enhance natural language interaction experience through a more authentic and credible audio output, but also gives rise to numerous acclaimed applications on public platforms.…”

Section: Introductionmentioning

confidence: 99%

Residual-based feature enhancement for forgery audio detection

Zheng,

Ling,

Hai

2023

Third International Conference on Advanced Algorithms and Signal Image Processing (AASIP 2023)

View full text Add to dashboard Cite

In recent years, speech synthesis technology has become increasingly advanced, leading to a proliferation of forged audio content on the internet, which poses significant threat to individuals and society. Many studies have utilized a range of deep learning-based techniques to differentiate fake audio content, but the features used in these studies are often limited in their rich and generalizable characteristics. In this paper, we propose a novel fake voice detection technology that utilizes the wav2vec2 model for feature extraction along with a custom-designed residual-based detection module to augment the detection of fake audio content with greater accuracy and precision. Additionally, we incorporate a data augmentation method to improve the performance of the model and enhance its ability to generalize. We trained our model on the ASVspoof2019 dataset and evaluated it on the LA and DF datasets of the ASVspoof2021 dataset. Supplementary experiments demonstrated that our approach achieved state-of-the-art detection performance and illustrated its effectiveness and applicability.

show abstract

Review of end-to-end speech synthesis technology based on deep learning

Cited by 8 publications

References 179 publications

Ten years of generative adversarial nets (GANs): a survey of the state-of-the-art

Ten years of generative adversarial nets (GANs): a survey of the state-of-the-art

Controllable Data Generation by Deep Learning: A Review

Residual-based feature enhancement for forgery audio detection

Contact Info

Product

Resources

About