In this paper, we study the face recognition and emotion recognition algorithms to monitor the emotions of preschool children. For previous emotion recognition focusing on faces, we propose to obtain more comprehensive information from faces, gestures, and contexts. Using the deep learning approach, we design a more lightweight network structure to reduce the number of parameters and save computational resources. There are not only innovations in applications, but also algorithmic enhancements. And face annotation is performed on the dataset, while a hierarchical sampling method is designed to alleviate the data imbalance phenomenon that exists in the dataset. A new feature descriptor, called “oriented gradient histogram from three orthogonal planes,” is proposed to characterize facial appearance variations. A new efficient geometric feature is also proposed to capture facial contour variations, and the role of audio methods in emotion recognition is explored. Multifeature fusion can be used to optimally combine different features. The experimental results show that the method is very effective compared to other recent methods in dealing with facial expression recognition problems about videos in both laboratory-controlled environments and outdoor environments. The method performed experiments on expression detection in a facial expression database. The experimental results are compared with data from previous studies and demonstrate the effectiveness of the proposed new method.