In crowded environments, the importance of automatic video surveillance cannot be overstated. It plays a vital role in detecting unusual incidents and averting accidents, particularly in areas teeming with pedestrians. Surveillance frameworks demonstrate their inherent strength when deployed in real-world scenarios. In this paper, an efficient algorithm that can identify abnormalities in videos has been proposed as a solution to the problems of high computing power and variation between negative and positive samples. To address the issue of fewer negative samples, the algorithm employs the spatiotemporal inter-fused autoencoder in a method of unsupervised learning to locate and extract the samples of negative from the dataset. A spatial-temporal convolutional neural network (CNN) is built using this methodology, and it has a basic structure and requires minimal computing power. To develop the model of detection, the spatiotemporal CNN is trained using the method of supervised training with negative and positive data. The UMN and UCSD datasets are used as benchmark datasets for this method. The outcomes of the experiment demonstrate that the proposed method is more accurate than the existing algorithms at both frame and pixel levels and it can locate anomalous behaviors in real-time. The proposed method achieves better accuracy of 99.76%, 99.92%, 99.15%, 98.45%, and 94.67% in UCSD ped1, UCSD ped2, and UMN scene 1, 2, 3 datasets compared to the hybrid CNN and RF classifiers, MCMS-BCN attention+densenet121/Efficientnetv2, gradient motion descriptor (PGD) and enhanced entropy classifier, and attention mechanism. Additionally, the proposed method achieves better AUC of 99.96%, 99.83%, 99.97%, 90.15%, and 99.72% in the datasets of UCSD ped1, UCSD ped2, and UMN scenes 1, 2, 3 compared to the low-rank and compact coefficient dictionary learning (LRCCDL), and hybrid CNN and RF classifiers.