A Facial Expression Recognition (FER) method based on Conv3D-ConvLSTM-SEnet in an online education environment is proposed to address the issue of low accuracy in current classroom expression recognition (FER) methods. Firstly, ConvLSTM is used to integrate the local feature extraction ability of CNN and the temporal modeling ability of LSTM, and based on this, the rich spatial features of the image are characterized. Then, by introducing Depth Separated Convolution (DSC) to change the number of output channels and separate each channel, the Feature Maps (F-M) are sequentially concatenated to obtain a multi-channel output F-M. Finally, based on ConvLSTM and SEnet modules, a Conv3D ConvLSTM-based FER model was proposed by redistributing the extracted abstract features and basic texture features, achieving high accuracy classroom FER. The proposed classroom FER method and the other five methods were compared and analyzed through simulation experiments using the CK[Formula: see text], FER2013 and JAFFE datasets, respectively. The results indicate that the proposed method achieves the highest accuracy, precision, recall, and F1 score, with improvements of at least 1.85%, 2.41%, 1.18% and 2.05%, respectively, compared to the other five methods on the FER2013 dataset. The proposed method can perform FER on students in online classroom teaching, detect their emotions in real-time, and help teachers make better course teaching adjustments based on this.