With the popularization of standardized classrooms in colleges and universities, it is possible to collect video data of students’ class status through the camera device in the classroom. With abundant video data sources, it is easy to obtain big data of students’ class status images. Unstructured video big data is a topic worthy of research in improving teaching quality. First, the current teaching ability of teachers in colleges and universities is investigated, and its problems are found. Then, the You Only Look Once (YOLO) network in the object detection network is mainly studied. The deficiencies in the network structure are further explored and optimized. It is used in real classroom scenarios as well as on student expression detection problems. Finally, the proposed scheme is tested. The test results show that at present, 20% and 38% of teachers in higher vocational colleges think that they are “dissatisfied” with their classroom teaching and practical guidance ability. And 38% of teachers wanted to improve the bad situation. The accuracy of the proposed model for student expression detection is higher than that of faster-region convolutional neural network and mask-region convolutional neural network by more than 8%, higher than the YOLO v3 model by more than 4%, and higher than YOLO v3 Tiny model above 6%. The proposed model provides some ideas for the application of deep learning technology in the improvement of teachers’ teaching ability.