Spectrogram Analysis Via Self-Attention for Realizing Cross-Model Visual-Audio Generation

Tan, Huadong; Wu, Guang; Zhao, Pengcheng; Chen, Yanxiang

doi:10.1109/icassp40776.2020.9052918

Cited by 4 publications

(6 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We have designated alongside every evaluation metric with arrows, where up-arrow (↑) indicates that a larger value is better and similarly the down-arrow (↓) suggests that a lower value is better. [30] audios and images are generated using the conditional GAN, while we generate videos and audio, which is more challenging using the joint learning. They also use the class label while training the model.…”

Section: Discussionmentioning

confidence: 99%

See 1 more Smart Citation

Collaborative Learning to Generate Audio-Video Jointly

Kurmi

Bajaj

Patro

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

There have been a number of techniques that have demonstrated the generation of multimedia data for one modality at a time using GANs, such as the ability to generate images, videos, and audio. However, so far, the task of multi-modal generation of data, specifically for audio and videos both, has not been sufficiently well-explored. Towards this, we propose a method that demonstrates that we are able to generate naturalistic samples of video and audio data by the joint correlated generation of audio and video modalities. The proposed method uses multiple discriminators to ensure that the audio, video, and the joint output are also indistinguishable from real-world samples. We present a dataset for this task and show that we are able to generate realistic samples. This method is validated using various standard metrics such as Inception Score, Frechet Inception Distance (FID) and through human evaluation.

show abstract

Section: Discussionmentioning

confidence: 99%

“…In sound2sight [29], future video frames and motion dynamics are generated by conditioning on audio and a few past frames. In SA-CMGAN [30] self-attention mechanism is applied to cross-modal visual-audio generation.…”

Section: Related Workmentioning

confidence: 99%

Collaborative Learning to Generate Audio-Video Jointly

Kurmi

Bajaj

Patro

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…It also achieves an end-to-end output with a Video-to-Audio (V-A) model. In [9], an SA-CMGAN model was introduced, which uses two networks to mutually generate images and audio. Each network includes an encoder, generator, and discriminator.…”

Section: Basic Methods and Principles Mainly Involved By Foleymentioning

confidence: 99%

An Overview of Visual Sound Synthesis Generation Tasks Based on Deep Learning Networks

Gao

2023

TETR

View full text Add to dashboard Cite

Visual sound synthesis (which refers to the process of recreating, as realistically as possible, the sound produced by the movements and actions of objects within a video, given specific conditions such as video content and accompanying text) is an important part of the composition of high-quality films at present. Most traditional methods of sound synthesis are based on the artificial creation of simulated props for sound effects synthesis, which is achieved by using various existing props and constructed scenes. However, traditional methods cannot meet specific conditions for sound effect synthesis and require large amounts of participant, material resources and time. It can take nearly ten hours to simulate realistic sound effects in a minute-long video. In this paper, we systematically summarize and consolidate current advances in deep learning in the field of visual sound synthesis, based on existing related papers. We focus on the exploration and development history of deep learning models for the task of visual sound synthesis, and classify detailed research methods and related dataset information based on their development characteristics. By analyzing the technical differences among various model approaches, we can summarize potential research directions in the field, thereby further promoting the rapid development and practical implementation of deep learning models in the video domain.

show abstract

“…Chen et al [9] focused on the generation of an image from the audio and vice-versa for single-instrument performance videos from the URMP dataset [39] using two Generative Adversarial Nets (GAN) [21] while Hao et al [24] improved the performance of the GAN with cross-modal cycle-consistency [82]. Furthermore, Tan et al [62] incorporated self-attention [68] into the GAN architecture and Su et al [60] proposed to generate a piano sound by vocoding Midi predicted from a video. Recently, Kurmi et al [36] brought a generation of short (1s) musical videos into the picture.…”

Section: Related Workmentioning

confidence: 99%

“…Previous works have proposed models to controllably generate e.g. images [13,17,38,45,48,51,55,57,73,76,77], videos [6,12,25,37,42,46,64,65,65,71], and audios [1,9,15,22,24,47,62,63], or separate sounds [18,19,79,80,84]. However, most of the audio works are music-related, and only a few attempts have been made to generate visually guided audio in an open domain setup [11,83].…”

mentioning

confidence: 99%

Taming Visually Guided Sound Generation

Iashin,

Rahtu

2021

Preprint

View full text Add to dashboard Cite

show abstract

Spectrogram Analysis Via Self-Attention for Realizing Cross-Model Visual-Audio Generation

Cited by 4 publications

References 12 publications

Collaborative Learning to Generate Audio-Video Jointly

Collaborative Learning to Generate Audio-Video Jointly

An Overview of Visual Sound Synthesis Generation Tasks Based on Deep Learning Networks

Taming Visually Guided Sound Generation

Contact Info

Product

Resources

About