Background. In crowded crowd images, traditional detection models often have the problems of inaccurate multiscale target count and low recall rate. Methods. In order to solve the above two problems, this paper proposes an MLP-CNN model, which combined with FPN feature pyramid can fuse the feature map of low-resolution and high-resolution semantic information with less computation and can effectively solve the problem of inaccurate head count of multiscale people. MLP-CNN “mid-term” fusion model can effectively fuse the features of RGB head image and RGB-Mask image. With the help of head RGB-Mask annotation and adaptive Gaussian kernel regression, the enhanced density map can be generated, which can effectively solve the problem of low recall of head detection. Results. MLP-CNN model was applied in ShanghaiTech and UCF_ CC_ 50 and UCF-QNRF. The test results show that the error of the method proposed in this paper has been significantly improved, and the recall rate can reach 79.91%. Conclusion. MLP-CNN model not only improves the accuracy of population counting in density map regression, but also improves the detection rate of multiscale population head targets.