SummaryIn previous studies on facial video depression recognition, although convolutional neural network (CNN) has become a mainstream method, its performance still has room for improvement due to the insufficient extraction of global and local information and the neglect of the correlation of temporal and spatial information. This paper proposes a novel dual‐task enhanced global–local temporal–spatial network (DTE‐GLTS) to enhance the extraction capability of global and local features and deepen the analysis of temporal–spatial information correlation. We design a dual‐task learning mode that utilizes the data‐efficient image transformer (Deit) as the main body to learn the global features of video sequences and guides Deit to learn local features with the pre‐trained temporal–spatial fusion network (TSF). In addition, we propose the TSF mechanism to more effectively fuse temporal–spatial information in video sequences, strengthen the correlation between frames and pixels, and embed it in Resnet to form the TSF network. To the best of our knowledge, this is the first application of Deit and dual‐task learning mode in the field of facial video depression recognition. The experimental results on AVEC 2013 and AVEC 2014 show that our method achieves competitive performance, with mean absolute error/root mean square error (MAE/RMSE) scores of 6.06/7.73 and 5.91/7.68, respectively, while significantly reducing the number of parameters.