With the development of deep learning and artificial intelligence, audio synthesis has a pivotal role in the area of machine learning and shows strong applicability in the industry. Meanwhile, significant efforts have been dedicated by researchers to handle multimodal tasks at present such as audio-visual multimodal processing. In this paper, we conduct a survey on audio synthesis and audio-visual multimodal processing, which helps understand current research and future trends. This review focuses on text to speech(TTS), music generation and some tasks that combine visual and acoustic information. The corresponding technical methods are comprehensively classified and introduced, and their future development trends are prospected. This survey can provide some guidance for researchers who are interested in the areas like audio synthesis and audio-visual multimodal processing.
introductionAudio synthesis, which aims to synthesis various form of natural and intelligible sound such as speech, music, has a wide range of application scenario in human society and industry. Initially, researchers took advantages of pure signal processing methods to find some convenient representations for audio, which can be easily modelled and transform to temporal audio. For example, short-time Fourier Transform(STFT) is an efficient way to convert audio into frequency domain and Griffin-Lim [31] is a kind of pure signal processing algorithm that is able to decode STFT sequence to temporal waveform. Methods similar to Griffin-Lim are WORLD [62], etc. In recent years, with the rapid development of deep learning technology, researchers have begun to build deep neural networks for audio synthesis and other multimodal tasks in order to simplify the pipeline and improve the performance of the model. Numerous neural network models have emerged so far for the tasks such as text to speech(TTS) and music generation. There are a lot of models for TTS that have been reported, such as Parallel WaveGAN [103], MelGAN [45], FastSpeech2/2s [80], EATs [21], VITS [40]. Simultaneously, there are many models for music generation like song from PI MuseGAN [23] and Jukebox [18]. These models bring great convenience to human production and life, and they provide key reference for future research.Vision is a physiological word. Humans and animals visually perceive the size, brightness, color, etc. of external objects, and obtain information that is essential for survival. Vision is the most important sense for human beings. Over these years, deep learning has been widely explored in various image processing and computer vision tasks such as image dehazing/deraining, objective detection and image segmentation, which contribute to the development of social productivity. Image dehazing/deraining means given a blurred image with haze/rain, algorithms are used to remove the haze/rain in the image to make it clear. [95,46,100,47,99,54,98,96,48] proposed neural network-based models for image dehazing/deraining respectively. Objective detection means finding out all ...