Visual sound synthesis (which refers to the process of recreating, as realistically as possible, the sound produced by the movements and actions of objects within a video, given specific conditions such as video content and accompanying text) is an important part of the composition of high-quality films at present. Most traditional methods of sound synthesis are based on the artificial creation of simulated props for sound effects synthesis, which is achieved by using various existing props and constructed scenes. However, traditional methods cannot meet specific conditions for sound effect synthesis and require large amounts of participant, material resources and time. It can take nearly ten hours to simulate realistic sound effects in a minute-long video. In this paper, we systematically summarize and consolidate current advances in deep learning in the field of visual sound synthesis, based on existing related papers. We focus on the exploration and development history of deep learning models for the task of visual sound synthesis, and classify detailed research methods and related dataset information based on their development characteristics. By analyzing the technical differences among various model approaches, we can summarize potential research directions in the field, thereby further promoting the rapid development and practical implementation of deep learning models in the video domain.