Benefiting from the powerful feature representation ability of deep learning, Convolutional Neural Network (CNN) provides a better solution to estimate accurately the number of people in a crowded scene, but it still faces many problems that need to be solved urgently. It is one of the key and difficult points in the field to reduce the complexity of the network and to improve the real-time performance of the network, so as to improve the accuracy of crowd counting. Firstly, this paper introduces the research background and application of crowd counting. Secondly, it focuses on the commonly used counting model, loss function, and dataset and evaluation method. Then compare the performance structure, advantages and disadvantages of different algorithms horizontally on several published datasets. Finally, it summarizes the shortcomings of the existing crowd counting, put forward to the future research direction of crowd counting.
Metallic surface defect detection is critical to ensure the quality of industrial products. Recently, human-advanced surface defect detection algorithms have been proposed. Most of these algorithms rely on convolutional neural networks (CNN) and an anchoring scheme. However, a convolution unit only samples the input feature maps at fixed shapes and locations. Similarly, a set of anchors are uniformly predefined with fixed scales and shapes, which increases the difficulties of bounding box regression. Therefore, we propose an adaptive convolution and anchor network for metallic surface defect detection, named ACA-Net. Specifically, an adaptive convolution and anchor (ACA) module is proposed, which mainly consists of adaptive convolution and an adaptive anchor. Firstly, an adaptive convolution module (ACM) is designed, which adaptively determines the location and shape of each convolution unit. In addition, a multi-scale feature adaptive fusion (MFAF) is proposed, which is used in ACM to extract and integrate multi-scale features. Then, an adaptive anchor module (AAM) is proposed to yield more suitable anchor boxes by adaptively adjusting shapes. Extensive experiments on NEU-DET dataset and GC10 dataset validate the performance of the proposed approach. ACA-Net achieves 1.8% on NEU-DET dataset higher Average Precision (AP) than GA-RetinaNet. Furthermore, the proposed ACA module is also adopted in GA-Faster R-CNN, improving the AP by 1.2% on NEU-DET dataset.
To the problem of the complex pre-processing and post-processing to obtain head-position existing in the current crowd localization method using pseudo boundary box and pre-designed positioning map, this work proposes an end-to-end crowd localization framework named WSITrans, which reformulates the weakly-supervised crowd localization problem based on Transformer and implements crowd counting. Specifically, we first perform global maximum pooling (GMP) after each stage of pure Transformer, which can extract and retain more detail of heads. In addition, we design a binarization module that binarizes the output features of the decoder and fuses the confidence score to obtain more accurate confidence score. Finally, extensive experiments demonstrate that the proposed method achieves significant improvement on three challenging benchmarks. It is worth mentioning that the WSITrans improves F1-measure by 4.0%.
Due to extreme scale variations in highly congested scenes, the accuracy of CNN-based crowd counting approaches still has considerable room for further improvements. In this paper, we propose a new multi-scale feature adaptive integrated network (MSFAINet) for crowd counting that adopts the multi-scale feature, hybrid attention, and dilated convolution. First, the proposed MSFAINet extracts feature maps from different levels by the improved VGG16 and focuses on more important information that represents features at different scales. Second, it adopts a hybrid attention mechanism to enhance the receptive field of an image while reducing the loss of feature information caused by channel competition and then passes these features into the dilated convolution combined with the traditional convolution. Finally, it generates the density estimation map by accelerating the convergence of the network. The proposed MSFAINet is used to conduct extensive studies to demonstrate the effectiveness of the approach on several mainstream datasets. From the experimental results, MSFAINet can extract and retain more detailed information and greatly reduce the influence of scale variations in crowd counting.