With the continuous development of literature and art, vocal education, as an important part of aesthetic education, has also been developed unprecedentedly. This paper constructs a music emotion recognition model based on deep learning and the Lstm network. The preprocessing of music is accomplished through the methods of pre-emphasis, frame-splitting, and windowing so as to improve the purity of the music signal. Using the Mel frequency cepstrum coefficient and cochlear frequency, the music analog signal is converted into frequency features so as to better distinguish the acoustic features and combined with Word2vec to realize the extraction of music emotion features. By comparing the Lstm music emotion recognition model with other models, the performance of this paper’s model is verified in terms of classification accuracy. To understand the role of its embodiment, the music recognition model is applied to the teaching of ethnic music. The results show that the average fitness of Lstm music recognition is 75%-90% with the increase of the number of evolutions, and the average fitness of Lstm objective function reaches the peak at the number of iterations of 40, with a fitness of 95%. Under the music recognition model, the students’ spectrum is elevated to 0db above the reference line, and the amplitude mostly floats in the interval in (-3, -15), and the teacher can formulate appropriate ethnic vocal music teaching for the students’ spectrum.