Foley Music: Learning to Generate Music from Videos

Gan, Chuang; Huang, Deng; Chen, Peihao; Tenenbaum, Joshua B.; Torralba, Antonio

doi:10.1007/978-3-030-58621-8_44

Cited by 93 publications

(46 citation statements)

References 52 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Composing music from silent videos. Previous works on music composition from silent videos focus on generating the music from video clips containing people playing various musical instruments, such as the violin, piano, and guitar [6] [21] [22]. Much of the generation result, e.g., the instrument type and even the rhythm, can be directly inferred from the movement of human hands, so the music is to some extent determined.…”

Section: Related Workmentioning

confidence: 99%

Video Background Music Generation with Controllable Music Transformer

Jiang

Liu

et al. 2021

Proceedings of the 29th ACM International Conference on Multimedia

View full text Add to dashboard Cite

In this work, we address the task of video background music generation. Some previous works achieve effective music generation but are unable to generate melodious music tailored to a particular video, and none of them considers the video-music rhythmic consistency. To generate the background music that matches the given video, we first establish the rhythmic relations between video and background music. In particular, we connect timing, motion speed, and motion saliency from video with beat, simu-note density, and simu-note strength from music, respectively. We then propose CMT, a Controllable Music Transformer that enables local control of the aforementioned rhythmic features and global control of the music genre and instruments. Objective and subjective evaluations show that the generated background music has achieved satisfactory compatibility with the input videos, and at the same time, impressive music quality. Code and models are available at https://github.com/wzk1015/video-bgm-generation. CCS CONCEPTS• Applied computing → Sound and music computing.

show abstract

Section: Related Workmentioning

confidence: 99%

Video Background Music Generation with Controllable Music Transformer

Jiang

Liu

et al. 2021

Proceedings of the 29th ACM International Conference on Multimedia

View full text Add to dashboard Cite

show abstract

“…A few very recent works have also explored the multimodal generation problem. Gan et al [26] synthesized plausible music for a silent video clip of people playing musical instruments. Another similar work [27] generated music for a given video.…”

Section: Related Workmentioning

confidence: 99%

Collaborative Learning to Generate Audio-Video Jointly

Kurmi

Bajaj

Patro

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

There have been a number of techniques that have demonstrated the generation of multimedia data for one modality at a time using GANs, such as the ability to generate images, videos, and audio. However, so far, the task of multi-modal generation of data, specifically for audio and videos both, has not been sufficiently well-explored. Towards this, we propose a method that demonstrates that we are able to generate naturalistic samples of video and audio data by the joint correlated generation of audio and video modalities. The proposed method uses multiple discriminators to ensure that the audio, video, and the joint output are also indistinguishable from real-world samples. We present a dataset for this task and show that we are able to generate realistic samples. This method is validated using various standard metrics such as Inception Score, Frechet Inception Distance (FID) and through human evaluation.

show abstract

“…Another interesting task is to localize objects that sound [64,4,54,65,67,11], where the goal is to pinpoint audio sources from the visual data. Other interesting works study audio-visual action recognition [35,38,26,58], audio-visual navigation [22,10,9], talking head synthesis [56], spatial audio from video [43,24,62,42], and visual-to-auditory [33,20].…”

Section: Related Workmentioning

confidence: 99%

V-SlowFast Network for Efficient Visual Sound Separation

Zhu¹,

Rahtu²

2021

Preprint

View full text Add to dashboard Cite

The objective of this paper is to perform visual sound separation: i) we study visual sound separation on spectrograms of different temporal resolutions; ii) we propose a new light yet efficient three-stream framework V-SlowFast that operates on Visual frame, Slow spectrogram, and Fast spectrogram. The Slow spectrogram captures the coarse temporal resolution while the Fast spectrogram contains the fine-grained temporal resolution; iii) we introduce two contrastive objectives to encourage the network to learn discriminative visual features for separating sounds; iv) we propose an audio-visual global attention module for audio and visual feature fusion; v) the introduced V-SlowFast model outperforms previous state-of-the-art in single-frame based visual sound separation on small-and large-scale datasets: MUSIC-21, AVE, and VGG-Sound. We also propose a small V-SlowFast architecture variant, which achieves 74.2% reduction in the number of model parameters and 81.4% reduction in GMACs compared to the previous multi-stage models. Project page: https://lyzhu.github.io/V-SlowFast.

show abstract

Foley Music: Learning to Generate Music from Videos

Cited by 93 publications

References 52 publications

Video Background Music Generation with Controllable Music Transformer

Video Background Music Generation with Controllable Music Transformer

Collaborative Learning to Generate Audio-Video Jointly

V-SlowFast Network for Efficient Visual Sound Separation

Contact Info

Product

Resources

About