Video Background Music Generation with Controllable Music Transformer

Di, Shangzhe; Jiang, Zeren; Liu, Si; Wang, Zhaokai; Zhu, Leyan; He, Zhongshi; Li, Hongming; Yan, Shuicheng

doi:10.1145/3474085.3475195

Cited by 52 publications

(28 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Dance2Music [1]: Similar to [16], the generated music with this method is also monotonic in terms of the musical instrument. Controllable Music Transformer (CMT) [10]: CMT is a Transformer-based model proposed for video background music generation using MIDI representation. In addition to the above cross-modality models that are closely related to our work, we also consider Ground Truth: GT samples are the original music from dance videos.…”

Section: Methodsmentioning

confidence: 99%

“…Gan et al [16] propose a graph-based transformer framework to generate music from performance videos using raw movement as input. Di et al [10] propose to generate video background music conditioned on the motion and special timing/rhythmic features of the input videos. In contrast to these previous works, our work combines three modalities, which takes the vision and motion data as input and generates music accordingly.…”

Section: Audio Vision and Motionmentioning

confidence: 99%

See 1 more Smart Citation

Quantized GAN for Complex Music Generation from Dance Videos

Zhu¹,

Olszewski²,

Wu³

et al. 2022

Preprint

View full text Add to dashboard Cite

We present Dance2Music-GAN (D2M-GAN), a novel adversarial multi-modal framework that generates complex musical samples conditioned on dance videos. Our proposed framework takes dance video frames and human body motion as input, and learns to generate music samples that plausibly accompany the corresponding input. Unlike most existing conditional music generation works that generate specific types of mono-instrumental sounds using symbolic audio representations (e.g., MIDI), and that heavily rely on pre-defined musical synthesizers, in this work we generate dance music in complex styles (e.g., pop, breakdancing, etc.) by employing a Vector Quantized (VQ) audio representation, and leverage both its generality and the high abstraction capacity of its symbolic and continuous counterparts. By performing an extensive set of experiments on multiple datasets, and following a comprehensive evaluation protocol, we assess the generative quality of our approach against several alternatives. The quantitative results, which measure the music consistency, beats correspondence, and music diversity, clearly demonstrate the effectiveness of our proposed method. Last but not least, we curate a challenging dance-music dataset of in-the-wild TikTok videos, which we use to further demonstrate the efficacy of our approach in real-world applications -and which we hope to serve as a starting point for relevant future research. The code is available at https://github.com/L-YeZhu/D2M-GAN.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Audio Vision and Motionmentioning

confidence: 99%

Quantized GAN for Complex Music Generation from Dance Videos

Zhu¹,

Olszewski²,

Wu³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Fine-grained control has been a topic of interest in the recent literature (Choi et al, 2020;Hadjeres & Crestel, 2020;Wu & Yang, 2021;Di et al, 2021;Ferreira & Whitehead, 2021) and is an essential property when considering userdirected applications. In essence, fine-grained control is necessary to allow control over salient features in the generation, as saliency in music at least partly lies in how it changes over time.…”

Section: Related Workmentioning

confidence: 99%

FIGARO: Generating Symbolic Music with Fine-Grained Artistic Control

Dimitri¹,

Biggio²,

Kilcher³

et al. 2022

Preprint

View full text Add to dashboard Cite

Generating music with deep neural networks has been an area of active research in recent years. While the quality of generated samples has been steadily increasing, most methods are only able to exert minimal control over the generated sequence, if any. We propose the self-supervised description-to-sequence task, which allows for fine-grained controllable generation on a global level. We do so by extracting high-level features about the target sequence and learning the conditional distribution of sequences given the corresponding high-level description in a sequence-tosequence modelling setup. We train FIGARO (FIne-grained music Generation via Attentionbased, RObust control) by applying descriptionto-sequence modelling to symbolic music. By combining learned high level features with domain knowledge, which acts as a strong inductive bias, the model achieves state-of-the-art results in controllable symbolic music generation and generalizes well beyond the training distribution.

show abstract

“…Wang et al proposed PianoTree VAE [31], which uses GRU to encode notes played at the same time and map them to a latent space to achieve controllable generation of polyphonic music based on a tree structure. Di et al achieved rhythmic consistency between video and background music and proposed Controllable Music Transformer [12] to locally control the rhythm while globally controlling the music genre and instruments.…”

Section: Controllable Music Generationmentioning

confidence: 99%

Melody Harmonization with Controllable Harmonic Rhythm

Wu¹,

Yang²,

Wang³

et al. 2021

Preprint

View full text Add to dashboard Cite

Melody harmonization, namely generating a chord progression for a user-given melody, remains a challenging task to this day. Although previous neural network-based systems can effectively generate an appropriate chord progression for a melody, few studies focus on controllable melody harmonization, and none of them can generate flexible harmonic rhythms. To achieve harmonic rhythmcontrollable melody harmonization, we propose AutoHarmonizer, a neural network-based melody harmonization system that can generate denser or sparser chord progressions with the use of a new sampling method for controllable generation proposed in this paper. This system mainly consists of two parts: a harmonic rhythm model provides coarse-grained chord onset information, while a chord model generates specific pitches for chords based on the given melody and the corresponding harmonic rhythm sequence previously generated. To evaluate the performance of AutoHarmonizer, we use nine metrics to compare the chord progressions from humans, the system proposed in this paper and the baseline. Experimental results show that AutoHarmonizer not only generates harmonic rhythms comparable to the human level, but generates chords with overall better quality than baseline at different settings. In addition, we use AutoHarmonizer to harmonize the Session Dataset (which were originally chordless), and ended with 40,925 traditional Irish folk songs with harmonies, named the Session Lead Sheet Dataset, which is the largest lead sheet dataset to date.

show abstract

Video Background Music Generation with Controllable Music Transformer

Cited by 52 publications

References 13 publications

Quantized GAN for Complex Music Generation from Dance Videos

Quantized GAN for Complex Music Generation from Dance Videos

FIGARO: Generating Symbolic Music with Fine-Grained Artistic Control

Melody Harmonization with Controllable Harmonic Rhythm

Contact Info

Product

Resources

About