The use of closed-circuit television (CCTV) for safety monitoring is crucial for reducing accidents in construction sites. However, the majority of currently proposed approaches utilize single detection models without considering the context of CCTV video inputs. In this study, a multimodal detection, and depth map estimation algorithm utilizing deep learning is proposed. In addition, the point cloud of the test site is acquired using a terrestrial laser scanning scanner, and the detected object's coordinates are projected into global coordinates using a homography matrix. Consequently, the effectiveness of the proposed monitoring system is enhanced by the visualization of the entire monitored scene. In addition, to validate our proposed method, a synthetic dataset of construction site accidents is simulated with Twinmotion. These scenarios are then evaluated with the proposed method to determine its precision and speed of inference. Lastly, the actual construction site, equipped with multiple CCTV cameras, is utilized for system deployment and visualization. As a result, the proposed method demonstrated its robustness in detecting potential hazards on a construction site, as well as its real-time detection speed