Conventional teaching evaluation emphasizes students' knowledge mastery over their affections. Multi-modal Affective Computing (MAC) can analyze versatile information of students in the classroom, including their facial expressions, gestures, and text feedback, in a comprehensive way, thereby helping teachers discover problems with students' affections in a timely manner, so that they could adjust the teaching methods and strategies accordingly. However, the available MAC technology might make unstable or wrong judgement when dealing with complex affective expressions, then the inaccurate evaluation results of students' affection state might adversely affect the teaching evaluation results. To tackle these issues, this study innovatively applied MAC in teaching evaluation based on combined processing of texts and images. The input texts were divided into two parts: main body and the hash tag, which were subjected to feature extraction respectively. The image features were extracted from two angles: object and scene, since the two angles can give image information of different levels. The MAC model was divided into modal sharing tasks and modal private tasks to attain better adaptability in case of new teaching evaluation scenarios. The effectiveness of the proposed method was verified by experimental results.