The outbreak of COVID-19 has brought drastic changes to English teaching as it has shifted from the offline mode before the pandemic to the online mode during the pandemic. However, in the post-pandemic era, there are still many problems in the effective implementation of the process of English teaching, leading to the inability of achieving better results in the quality and efficiency of English teaching and effective cultivation of students’ practical application ability. In recent years, English speaking has attracted the attention of experts and scholars. Therefore, this study constructs an interactive English-speaking practice scene based on a virtual character. A dual-modality emotion recognition method is proposed that mainly recognizes and analyzes facial expressions and physiological signals of students and the virtual character in each scene. Thereafter, the system adjusts the difficulty of the conversation according to the current state of students, toward making the conversation more conducive to the students’ understanding and gradually improving their English-speaking ability. The simulation compares nine facial expressions based on the eNTERFACE05 and CAS-PEAL datasets, which shows that the emotion recognition method proposed in this manuscript can effectively recognize students’ emotions in interactive English-speaking practice and reduce the recognition time to a great extent. The recognition accuracy of the nine facial expressions was close to 90% for the dual-modality emotion recognition method in the eNTERFACE05 dataset, and the recognition accuracy of the dual-modality emotion recognition method was significantly improved with an average improvement of approximately 5%.