Mine video surveillance has a key role in ensuring the production safety of intelligent mining. However, existing mine intelligent monitoring technology mainly processes the video data in the cloud, which has problems, such as network congestion, large memory consumption, and untimely response to regional emergencies. In this paper, we address these limitations by utilizing the edge-cloud collaborative optimization framework. First, we obtained a coarse model using the edge-cloud collaborative architecture and updated this to realize the continuous improvement of the detection model. Second, we further proposed a target detection model based on the Vision Swin Transformer-YOLOv5(ViST-YOLOv5) algorithm and improved the model for edge device deployment. The experimental results showed that the object detection model based on ViST-YOLOv5, with a model size of only 27.057 MB, improved the average detection accuracy is by 25% compared to the state-of-the-art model, which makes it suitable for edge-end deployment in mining workface. For the actual mine surveillance video, the edge-cloud collaborative architecture can achieve better performance and robustness in typical application scenarios, such as weak lighting and occlusion, which verifies the feasibility of the designed architecture.