Since the traditional vocal music course cannot meet the learning objectives of current music majors, this paper explores the reform of vocal music courses by combining modern multimedia technology. First, the teaching experience brought by immersive multimedia technology is explored, and based on the spatial and temporal features of the interactive behavior of the vocal music online course are extracted by the convolutional neural network, and multimodal fusion is carried out by combining linguistic features. Then, the GCN-LSTM network is used to model its spatial structure information and long temporal dependency, based on the joint learning of multimodal features, and then identify the interactive behavior. Finally, after constructing the teacher-student features, the extraction effect, recognition effect, and interactivity results of the behavioral features of the vocal online course are analyzed. The results show that the overall teacher-student behavioral feature extraction is above 0.7, except for “writing” with an accuracy of 0.76, “speaking”, and “listening”. The recognition effect of the overall teacher’s behaviors is more than 0.73, and in the range of 100s~250s, students “raise their hands” frequently and interact more. The students “raised their hands” frequently in the 100s~250s and interacted with each other more frequently.