Anomaly event detection in crowd scenes is extremely important; however, the majority of existing studies merely use hand-crafted features to detect anomalies. In this study, a novel unsupervised deep learning framework is proposed to detect anomaly events in crowded scenes. Specifically, low-level visual features, energy features, and motion map features are simultaneously extracted based on spatiotemporal energy measurements. Three convolutional restricted Boltzmann machines are trained to model the mid-level feature representation of normal patterns. Then a multimodal fusion scheme is utilized to learn the deep representation of crowd patterns. Based on the learned deep representation, a one-class support vector machine model is used to detect anomaly events. The proposed method is evaluated using two available public datasets and compared with state-of-the-art methods. The experimental results show its competitive performance for anomaly event detection in video surveillance.
The major challenges for density maps estimation and accurate counting stem from the largescale variations, serious occlusions, and perspective distortions. Existing methods generally suffer from the blurred density maps, which are caused by average convolution kernel, and the ineffective estimation across different crowd scenes. In this paper, we propose a multi-scale fusion conditional generative adversarial network (MFC-GAN) that can generate high-resolution and high-quality density maps. The fusion module of MFC-GAN is embedded in a multi-scale generator and discriminator architecture with a novel adversarial loss, which is designed to guide high-resolution density maps generation. In order to address the problem of scale variation, we further propose a bidirectional fusion module. It combines deep global semantic features and shallow local information by leveraging feature maps presented in different layers of the generator. Furthermore, in order to increase the effectiveness of the multi-scale fusion, we design a cross-attention fusion module, which weights the multi-scale fused feature and learns context-aware feature maps for generating high quality density maps. The experiments on four challenging datasets show the effectiveness, feasibility and robustness of the proposed MFC-GAN.
Video event detection is a challenging problem in many applications, such as video surveillance and video content analysis. In this paper, we propose a new framework to perceive high-level codewords by analyzing temporal relationship between different channels of video features. The low-level vocabulary words are firstly generated after different audio and visual feature extraction. A weighted undirected graph is constructed by exploring the Granger Causality between low-level words. Then, a greedy agglomerative graph-partitioning method is used to discover low-level word groups which have similar temporal pattern. The high-level codebooks representation is obtained by quantification of low-level words groups. Finally, multiple kernel learning, combined with our high-level codewords, is used to detect the video event. Extensive experimental results show that the proposed method achieves preferable results in video event detection.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.