The coexistence of human populations and wildlife in shared habitats necessitates the development of effective intrusion detection systems to mitigate potential conflicts and promote harmonious relationships. Detecting the intrusion of wild animals, especially in areas where human-wildlife conflicts are common, is essential for both human and animal safety. Animal intrusion has become a serious threat to crop yield, impacting food security and reducing farmer profits. Rural residents and forestry workers are increasingly concerned about the issue of animal assaults. Drones and surveillance cam-eras are frequently used to monitor the movements of wild animals. To identify the type of animal, track its movement, and provide its position, an effective model is needed. This paper presents a novel methodology for detecting the intrusion of wild animals using deep neural networks with multishift spatio-temporal features from surveillance camera video images. The pro-posed method consists of a multi-shift attention convolutional neural net-work model to extract spatial features, a multi-moment gated recurrent unit attention model to extract temporal features, and a feature fusion network to fully explore the spatial semantics and temporal features of surveillance video images. The proposed model was tested with images from three different datasets and achieved promising results in terms of mean accuracy and precision.