Under the development needs of the times, cultivating students’ ability to master rhythms is crucial to improving the quality of English talent cultivation. The vocal rhythm teaching method is used to establish an optimization model for music teaching in colleges and universities in this paper. In this teaching model, a real-time music beat recognition method combining music styles is proposed based on recurrent neural network and long and short-term memory neural network, and the feature fusion of mutual attention mechanism is utilized to carry out emotion recognition of multimodal music, to accurately control the music learning effect of students. Concerning the effectiveness of the optimization model of music teaching in colleges and universities, it is applied to the teaching practice, and the data are quantitatively analyzed in terms of music recognition ability and teaching effect. The results show that the recognition rate of the multimodal music emotion recognition method for 10 kinds of emotion categories of folk songs reaches up to 0.98, and the mean value of the final music literacy assessment scores of the experimental class after the experiment is increased by 2.89-3.38 points compared with that before the experiment, and the positive classroom mood shows a very significant difference at the level of 5%. The application of the optimization mode of college music teaching based on vocal rhythmic teaching methods in the practical teaching of music courses can cultivate students’ interest in music, help students improve their aesthetic level, and promote their overall development.