Aiming at the problem that the existing crowd counting methods cannot achieve accurate crowd counting and map visualization in a large scene, a crowd density estimation and mapping method based on surveillance video and GIS (CDEM-M) is proposed. Firstly, a crowd semantic segmentation model (CSSM) and a crowd denoising model (CDM) suitable for high-altitude scenarios are constructed by transfer learning. Then, based on the homography matrix between the video and remote sensing image, the crowd areas in the video are projected to the map space. Finally, according to the distance from the crowd target to the camera, the camera inclination, and the area of the crowd polygon in the geographic space, a BP neural network for the crowd density estimation is constructed. The results show the following: (1) The test accuracy of the CSSM was 96.70%, and the classification accuracy of the CDM was 86.29%, which can achieve a high-precision crowd extraction in large scenes. (2) The BP neural network for the crowd density estimation was constructed, with an average error of 1.2 and a mean square error of 4.5. Compared to the density map method, the MAE and RMSE of the CDEM-M are reduced by 89.9 and 85.1, respectively, which is more suitable for a high-altitude camera. (3) The crowd polygons were filled with the corresponding number of points, and the symbol was a human icon. The crowd mapping and visual expression were realized. The CDEM-M can be used for crowd supervision in stations, shopping malls, and sports venues.