Conventional motion tutorials rely mainly on a predefined motion and vision-based feedback that normally limits the application scenario and requires professional devices. In this paper, we propose VoLearn, a cross-modal system that provides operability for user-defined motion learning. The system supports the ability to import a desired motion from RGB video and animates the motion in a 3D virtual environment. We built an interface to operate on the input motion, such as controlling the speed, and the amplitude of limbs for the respective directions. With exporting of virtual rotation data, a user can employ a daily device (i.e., smartphone) as a wearable device to train and practice the desired motion according to comprehensive auditory feedback, which is able to provide both temporal and amplitude assessment. The user study demonstrated that the system helps reduce the amplitude and time errors of motion learning. The developed motion-learning system maintains the characteristics of high user accessibility, flexibility, and ubiquity in its application.