Video supervision equipment, which is readily available in most cities, can record the processes of urban floods in video form. Ubiquitous reference objects, which often appear in videos, can be used to indicate urban waterlogging depths. This makes video images a valuable data source for obtaining waterlogging depths. However, the urban waterlogging information contained in video images has not been effectively mined and utilized. In this paper, we present a method to automatically estimate urban waterlogging depths from video images based on ubiquitous reference objects. First, reference objects from video images are detected during the flooding and non-flooding periods using an object detection model with a convolutional neural network (CNN). Then, waterlogging depths are estimated using the height differences between the detected reference objects in these two periods. A case study is used to evaluate the proposed method. The results show that our proposed method could effectively mine and utilize urban waterlogging depth information from video images. This method has the advantages of low economic cost, acceptable accuracy, high spatiotemporal resolution, and wide coverage. It is feasible to promote this proposed method within cities to monitor urban floods.