Crowd counting has become a noteworthy vision task due to the needs of numerous practical applications, but it remains challenging. State‐of‐the‐art methods generally estimate the density map of the crowd image with the high‐level semantic features of various deep convolutional networks. However, the absence of low‐level spatial information may result in counting errors in the local details of the density map. To this end, a novel framework named Multi‐level Feature Fusion Network (MFFN) for single image crowd counting is proposed. The proposed MFFN, which is constructed in an encoder–decoder fashion, incorporates semantic and spatial information for generating high‐resolution density maps of input crowd images. Skip connections are developed between the encoder and the decoder so that low‐level spatial information and high‐level semantic features can be combined by element‐wise addition. In addition, a dense dilated convolution block is placed behind the encoder, extracting multi‐scale context features to guide feature fusion by a channel attention mechanism. The model is trained by multi‐task learning; semantic segmentation supervision is introduced to enhance feature representation. Extensive experiments are conducted on three crowd counting datasets (ShanghaiTech, UCF_CC_50, UCF‐QNRF), and the results show that MFFN outperforms state‐of‐the‐art methods. In addition, sufficient ablation studies are performed to verify the effectiveness of each component in our proposed method.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.