Online professional-creative fusion education in music majors is becoming increasingly prevalent, but accurately identifying students' classroom states remains a challenge. This research aims to propose a fusion approach based on depth separable convolution and convolutional neural network models for the recognition of online music major students' classroom states. Firstly, facial expressions of students during class are collected through sensor data. Subsequently, convolutional neural network models process these feature data and perform classification, with an enhancement using depth separable convolution. Simultaneously, behavioral data and assessment information of students during classes are fused as multimodal data, yielding the integrated results of students' classroom states. Experimental validation demonstrates that the proposed fusion method exhibits excellent performance in recognizing students' classroom states, with an average F1 score of 0.96, recall rate of 0.92, recognition accuracy of 94.12%, and recognition time of 2.10 seconds. This method accurately distinguishes whether students are focused, distracted, or not in a class state, providing an effective tool for music educators to better understand students' learning states and facilitate personalized teaching management and guidance.