Video anomaly detection (VAD) plays a crucial role in fields such as security, production, and transportation. To address the issue of overgeneralization in anomaly behavior prediction by deep neural networks, we propose a network called AMFCFBMem-Net (appearance and motion feature cross-fusion block memory network), which combines appearance and motion feature cross-fusion blocks. Firstly, dual encoders for appearance and motion are employed to separately extract these features, which are then integrated into the skip connection layer to mitigate the model’s tendency to predict abnormal behavior, ultimately enhancing the prediction accuracy for abnormal samples. Secondly, a motion foreground extraction module is integrated into the network to generate a foreground mask map based on speed differences, thereby widening the prediction error margin between normal and abnormal behaviors. To capture the latent features of various models for normal samples, a memory module is introduced at the bottleneck of the encoder and decoder structures. This further enhances the model’s anomaly detection capabilities and diminishes its predictive generalization towards abnormal samples. The experimental results on the UCSD Pedestrian dataset 2 (UCSD Ped2) and CUHK Avenue anomaly detection dataset (CUHK Avenue) demonstrate that, compared to current cutting-edge video anomaly detection algorithms, our proposed method achieves frame-level AUCs of 97.5% and 88.8%, respectively, effectively enhancing anomaly detection capabilities.