2020
DOI: 10.1007/978-3-030-58621-8_44
|View full text |Cite
|
Sign up to set email alerts
|

Foley Music: Learning to Generate Music from Videos

Abstract: In this paper, we introduce Foley Music, a system that can synthesize plausible music for a silent video clip about people playing musical instruments. We first identify two key intermediate representations for a successful video to music generator: body keypoints from videos and MIDI events from audio recordings. We then formulate music generation from videos as a motion-to-MIDI translation problem. We present a Graph−Transformer framework that can accurately predict MIDI event sequences in accordance with th… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

0
46
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
5
5

Relationship

0
10

Authors

Journals

citations
Cited by 93 publications
(46 citation statements)
references
References 52 publications
0
46
0
Order By: Relevance
“…Composing music from silent videos. Previous works on music composition from silent videos focus on generating the music from video clips containing people playing various musical instruments, such as the violin, piano, and guitar [6] [21] [22]. Much of the generation result, e.g., the instrument type and even the rhythm, can be directly inferred from the movement of human hands, so the music is to some extent determined.…”
Section: Related Workmentioning
confidence: 99%
“…Composing music from silent videos. Previous works on music composition from silent videos focus on generating the music from video clips containing people playing various musical instruments, such as the violin, piano, and guitar [6] [21] [22]. Much of the generation result, e.g., the instrument type and even the rhythm, can be directly inferred from the movement of human hands, so the music is to some extent determined.…”
Section: Related Workmentioning
confidence: 99%
“…A few very recent works have also explored the multimodal generation problem. Gan et al [26] synthesized plausible music for a silent video clip of people playing musical instruments. Another similar work [27] generated music for a given video.…”
Section: Related Workmentioning
confidence: 99%
“…Another interesting task is to localize objects that sound [64,4,54,65,67,11], where the goal is to pinpoint audio sources from the visual data. Other interesting works study audio-visual action recognition [35,38,26,58], audio-visual navigation [22,10,9], talking head synthesis [56], spatial audio from video [43,24,62,42], and visual-to-auditory [33,20].…”
Section: Related Workmentioning
confidence: 99%