2021
DOI: 10.48550/arxiv.2108.00443
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

A Survey on Audio Synthesis and Audio-Visual Multimodal Processing

Zhaofeng Shi

Abstract: With the development of deep learning and artificial intelligence, audio synthesis has a pivotal role in the area of machine learning and shows strong applicability in the industry. Meanwhile, significant efforts have been dedicated by researchers to handle multimodal tasks at present such as audio-visual multimodal processing. In this paper, we conduct a survey on audio synthesis and audio-visual multimodal processing, which helps understand current research and future trends. This review focuses on text to s… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
1
1

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 81 publications
0
2
0
Order By: Relevance
“…A prototypical text-to-speech system (Taylor 2009;Shi 2021;Tan et al 2021) consists of two basic parts. First, the text to be spoken is specified and (typically) converted into a phonetic and prosodic representation that captures the specific sounds, intonation, stress, and rhythm to be spoken.…”
Section: Audiomentioning
confidence: 99%
“…A prototypical text-to-speech system (Taylor 2009;Shi 2021;Tan et al 2021) consists of two basic parts. First, the text to be spoken is specified and (typically) converted into a phonetic and prosodic representation that captures the specific sounds, intonation, stress, and rhythm to be spoken.…”
Section: Audiomentioning
confidence: 99%
“…Due to the complexity of this task, alternative models such as normalizing flows Prenger et al [2019] (NFs) or diffusion models Kong et al [2020] (DFs) have also been successfully used, both also involving latent representations z. For more comprehensive surveys of these research domains, see Briot et al [2017], Shi [2021].…”
Section: Generative Modelsmentioning
confidence: 99%