In this paper, we first present the structure of the Hierarchical Sentiment Analysis Model for Multimodal Fusion (HMAMF). The model uses the Bi-LSTM method to extract unimodal music features and a CME encoder for feature fusion. After unimodal sentiment analysis, the loss function of the auxiliary training dataset is obtained and co-trained. Finally, the application of the HMAMF model in university music teaching is being explored. The results show that the agreement between the dominant sentiment of the HMAMF model and the prediction results is >80%, and the model is well-tested. The model underwent 35 training sessions when the correct rate for network recognition was 97.19%. The mean accuracy of the model’s 3-time recognition for music lengths from 50 seconds to 300 seconds ranged from 87.92% to 98.20%, and there was a slight decrease in the accuracy of the model’s recognition as the music length increased. The mood and beat of the music were judged by the model in a way that was highly consistent with the students’ delineation results. Students and teachers’ satisfaction with the performance of the sentiment analysis model in terms of “music tempo, rhythm, mood, content, and recognition time” ranged from 81.15% to 85.83% and from 83.25% to 92.39%, respectively. Teachers and students are satisfied with the HMAMF model proposed in this paper at a rate of 89.43% and 90.97%, respectively. The HMAMF model is proven to be suitable for use in the music teaching process.