Compared with single image based crowd counting, video provides the spatial-temporal information of the crowd that would help improve the robustness of crowd counting. But translation, rotation and scaling of people lead to the change of density map of heads between neighbouring frames. Meanwhile, people walking in/out or being occluded in dynamic scenes leads to the change of head counts. To alleviate these issues in video crowd counting, a Locality-constrained Spatial Transformer Network (LSTN) is proposed. Specifically, we first leverage a Convolutional Neural Networks to estimate the density map for each frame. Then to relate the density maps between neighbouring frames, a Locality-constrained Spatial Transformer (LST) module is introduced to estimate the density map of next frame with that of current frame. To facilitate the performance evaluation, a large-scale video crowd counting dataset is collected, which contains 15K frames with about 394K annotated heads captured from 13 different scenes. As far as we know, it is the largest video crowd counting dataset. Extensive experiments on our dataset and other crowd counting datasets validate the effectiveness of our LSTN for crowd counting. All our dataset are released in https://github.com/sweetyy83/Lstn_fdst_ dataset.
Accurate detection of multi-class instance objects in remote sensing images (RSIs) is a fundamental but challenging task in the field of aviation and satellite image processing, which plays a crucial role in a wide range of practical applications. Compared with the natural image-based object detection task, RSIs-based object detection still faces two main challenges: 1) The instance objects often present large variations in object size, and they are densely arranged in the given input images; 2) Complex background distributions around instance objects tend to cause boundary blurring, making it difficult to distinguish instance objects from the background, resulting in undesired feature learning interference. In this paper, to address the above challenges, we propose a novel RSI anchor-free object detection framework that consists of two key components: a cross-channel feature pyramid network (CFPN) and multiple foregroundattentive detection heads (FDHs). First, an anchor-free baseline detector with the CFPN structure is developed to extract features from different convolutional layers and incorporates these multi-scale features through parameterized cross-channel learning processes, learning the semantic relations across different scales and levels. Next, each FDH is designed to predict an attention map to enhance the features of the foreground region in RSIs. Furthermore, under this scale-aware anchor-free baseline detector structure, we design a curriculum-style optimization objective that dynamically reweights training instances during the current training epoch, enabling the detector to receive relatively easy instances that match with its current ability. Experimental results on three publicly available object detection datasets demonstrate that the proposed method outperforms existing object detection methods.Index Terms-Remote sensing images, anchor-free object detection, feature pyramid structure, foreground attention, curriculum learning.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.