In the task of human behavior detection, video classification based on deep learning has become a prevalent technique. The existing models are limited due to an inadequate understanding of behavior characteristics, which restricts their ability to achieve more accurate recognition results. To address this issue, this paper proposes a new model, which is an improvement upon the existing PPTSM model. Specifically, our model employs a multi-scale dilated attention mechanism, which enables the model to integrate multi-scale semantic information and capture characteristic information of abnormal human behavior more effectively. Additionally, to enhance the characteristic information of human behavior, we propose a gradient flow feature information fusion module that integrates high-level semantic features with low-level detail features, enabling the network to extract more comprehensive features. Experiments conducted on an elevator passenger dataset containing four abnormal behaviors (door picking, jumping, kicking, and door blocking) show that the top-1 Acc of our model is improved by 10% compared to the PPTSM model, reaching 95%. Moreover, experiments with four publicly available datasets(UCF24, UCF101, HMDB51, and the Something-Something-v1 dataset) demonstrate that our method achieves results superior to PPTSM by 6.8%, 6.1%, 21.2%, and 3.96%, respectively.