Automatically synthesizing dance motion sequences is an increasingly popular research task in the broader field of human motion analysis. Recent approaches have mostly used recurrent neural networks (RNNs), which are known to suffer from prediction error accumulation, usually limiting models to synthesize short choreographies of less than 100 poses. In this paper we present a multimodal convolutional autoencoder that combines 2D skeletal and audio information by employing an attention-based feature fusion mechanism, capable of generating novel dance motion sequences of arbitrary length. We first validate the ability of our system to capture the temporal context of dancing in a unimodal setting, by considering only skeletal features as input. According to 1440 rating answers provided by 24 participants in our initial user-study, we show that the optimal performance was presented by the model that was trained with input sequences of 500 poses. Based on this outcome, we train the proposed multimodal architecture with two different approaches, namely teacher-forcing and self-supervised curriculum learning, to deal with the autoregressive error accumulation phenomenon. In our evaluation campaign, we generate 1800 sequences and compare our method against two state-of-the-art approaches. Through qualitative and quantitative experiments we demonstrate the improvements introduced by the proposed multimodal architecture in terms of realism, motion diversity and multimodality, reducing the Fréchet Inception Distance (FID) metric value by 0.39. Subjective results confirm the effectiveness of our approach to synthesize diverse dance motion sequences, reporting a 6% increase in style consistency preference according to 1800 answers provided by 45 evaluators.