Surveillance systems focus on the image itself, mainly from the perspective of computer vision, which lacks integration with geographic information. It is difficult to obtain the location, size, and other spatial information of moving objects from surveillance systems, which lack any ability to couple with the geographical environment. To overcome such limitations, we propose a fusion framework of 3D geographic information and moving objects in surveillance video, which provides ideas for related research. We propose a general framework that can extract objects’ spatial–temporal information and visualize object trajectories in a 3D model. The framework does not rely on specific algorithms for determining the camera model, object extraction, or the mapping model. In our experiment, we used the Zhang Zhengyou calibration method and the EPNP method to determine the camera model, YOLOv5 and deep SORT to extract objects from a video, and an imaging ray intersection with the digital surface model to locate objects in the 3D geographical scene. The experimental results show that when the bounding box can thoroughly outline the entire object, the maximum error and root mean square error of the planar position are within 31 cm and 10 cm, respectively, and within 10 cm and 3 cm, respectively, in elevation. The errors of the average width and height of moving objects are within 5 cm and 2 cm, respectively, which is consistent with reality. To our knowledge, we first proposed the general fusion framework. This paper offers a solution to integrate 3D geographic information and surveillance video, which will not only provide a spatial perspective for intelligent video analysis, but also provide a new approach for the multi-dimensional expression of geographic information, object statistics, and object measurement.