2019
DOI: 10.1109/access.2019.2953698
|View full text |Cite
|
Sign up to set email alerts
|

Music-Driven Dance Generation

Abstract: In this paper, a novel model for synthesizing dance movements from music/audio sequence is proposed, which has variety of potential applications, e.g. virtual reality. For a given unheard song, in order to generate musically meaningful and natural dance movements, the following criteria should be met: 1) the rhythm between the dance action and music beat should be harmonious; 2) the generated dance movements should have notable and natural variations. Specifically, a sequence to sequence (Seq2Seq) learning arc… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
8
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 16 publications
(8 citation statements)
references
References 26 publications
0
8
0
Order By: Relevance
“…Due to the great amount of freely accessible dance videos, mostly hosted on platforms such as YouTube, multiple studies utilized automatic pose estimation methods to extract 2D skeletal poses from such videos and construct training data. The system presented in [19] proposed an autoregressive multimodal autoencoder based on two LSTM encoders for the two unimodal inputs (skeletal features and music features), which are fused in the decoder with a self-attention mechanism. Similarly, the authors in [18] proposed a multimodal convolutional autoencoder for synthesizing original dance sequences, conditioned on the mel-spectrogram of an input song.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Due to the great amount of freely accessible dance videos, mostly hosted on platforms such as YouTube, multiple studies utilized automatic pose estimation methods to extract 2D skeletal poses from such videos and construct training data. The system presented in [19] proposed an autoregressive multimodal autoencoder based on two LSTM encoders for the two unimodal inputs (skeletal features and music features), which are fused in the decoder with a self-attention mechanism. Similarly, the authors in [18] proposed a multimodal convolutional autoencoder for synthesizing original dance sequences, conditioned on the mel-spectrogram of an input song.…”
Section: Related Workmentioning
confidence: 99%
“…Most of the proposed methods in the literature tackle the challenges of automatic dance motion generation by implementing Recurrent Neural Networks (RNNs) for modelling the temporal correlation of skeletal poses with musical information and generating novel motion sequences [13], [15], [19]- [26]. Nevertheless, RNNs are computationally cumbersome and inefficient for modelling very long sequences, since the error accumulation in the predicted pose sequences allows synthesis over a limited range of future time-steps [27].…”
Section: Introductionmentioning
confidence: 99%
“…Much work has been done in the area of automatic choreography generation from music. The work can be categorized into searching based methods [5], [13], [30]- [32], shallow neural networks like Factored Conditional Restricted Boltzmann Machine (FCRBM) [10], deep neural networks [7], [9], [11], [33], [34], and other methods including dynamic time warping [35] and hidden Markov models [6], [14].…”
Section: Music To Motion Generationmentioning
confidence: 99%
“…Mel frequency cepstral coefficients (MFCC) features and their derivatives are commonly used to describe the pitch of a song in music-driven animation generation, and have shown to produce good results (Fukayama and Goto, 2015;Alemi et al, 2017;Lee et al, 2018;Lee et al, 2018;Tang et al, 2018;Shlizerman et al, 2018;Qi et al, 2019). A Mel is a unit of measure based on how human ears perceive frequency.…”
Section: Pitchmentioning
confidence: 99%
“…Because the audio is recorded in a different frequency (sampling rate of 44100 kHz or 44100 audio samples per second) than the motion capture data (100 video frames per second), the MFCC features must be aligned to the motion capture by dividing the audio into windows of size h, which is computed by (Qi et al, 2019):…”
Section: Pitchmentioning
confidence: 99%