Amidst the swift advancements in multimedia technology and artificial intelligence, the exploration and deployment of intelligent educational technologies that leverage deep learning and neural networks have intensified. This study introduces an innovative automatic student behavior recognition model tailored for college English translation classrooms. Utilizing Multimodal Fusion, the model employs a Histogram of Oriented Gradients (HOG) to capture the facial expression modality and Mel Frequency Cepstral Coefficients (MFCCs) to discern the linguistic modality of students. Through Linear Discriminant Analysis, the model effectively reduces the dimensionality of multimodal data, facilitating the integration of both modalities for enhanced feature fusion and decision-making processes. Following the acquisition of multimodal emotion recognition data, students evaluate the recognized emotion values, enabling the calibration of an emotion scale. This scale serves to align specific learning emotions with corresponding learning states, providing a framework to assess students’ engagement levels. A notable improvement was observed over four weeks, with the average incidence of students classified under a “positive” learning state escalating from 9.5 to 26.7. This significant rise underscores the increasing engagement and interaction within the college English translation classroom, facilitated by the Multimodal Fusion model. It illustrates the enhanced convenience for educators in managing classroom dynamics.