With the rapid advancements in artificial intelligence (AI) technology, its deployment in the field of education has gained considerable attention, particularly in the context of mental health education. Addressing the mounting academic and social pressures faced by contemporary students necessitates the utilization of cutting-edge techniques to accurately discern their emotional states and deliver customized learning resources. Existing methodologies for mental health education often fall short due to an over-reliance on educators’ experience and observations, as well as challenges in handling complex multimodal data. This research aims to investigate the integration of multimodal audio-visual features using a transformer architecture for emotion recognition. An enhanced probabilistic matrix factorization (PMF) model has been concurrently developed to facilitate tailored content recommendations for students. The goal is to provide a more accurate and effective approach to health education.