Figure 1: Given an audio of speech, as well as arbitrary motion-related text prompt, our method can generate full-body synergistic motion matching both speech content and prompt even if the motion is unseen in the speech-to-motion dataset used for training, such as the "walking in a clockwise circle" example in the figure. Meanwhile, the generation result is also highly consistent with the script content and the audio rhythm of the input speech.