The work of music performance system is to control the light change by identifying the emotional elements of music. Therefore, once the identification error occurs, it will not be able to create a good stage effect. Therefore, a multimodal music emotion recognition method based on image sequence is studied. The emotional characteristics of music are analyzed, including acoustic characteristics, melody characteristics, and audio characteristics, and the feature vector is constructed. The recognition and classification model based on neural network is trained, the weight and threshold of each layer are adjusted, and then the feature vector is input into the trained model to realize the intelligent recognition and classification of multimodal music emotion. The threshold of the starting point range of a specific humming note is given by the center clipping method, which is used to eliminate the low amplitude part of the humming note signal, extract the short-time spectral structure features and envelope features of the pitch, and complete the multimodal music emotion recognition. The results show that the calculated kappa coefficient k is greater than 0.75, which shows that the recognition and classification results are in good agreement with the actual results, and the classification and recognition accuracy is high.