Real-time traffic monitoring becomes an essential part of an intelligent city. In recent years, the adoption of surveillance cameras is rapidly growing because they are helpful to manage and control the traffic. However, it is impossible to install cameras on every road in a city due to the high costs of deployment and maintenance. Given the information from limited surveillance cameras, can we infer the citywide traffic volume accurately? This is a challenging question because we have no historical data on the roads without cameras. It requires us to design a method that goes beyond the inference using nearby traffic data. Moreover, a nice property of surveillance camera data is that these AI-equipped cameras can recognize individual vehicles. So we can recover incomplete trajectories for vehicles using plate numbers in surveillance camera records. However, for road segments without cameras, we do not know whether those vehicles pass through them or not. How can such incomplete trajectories be effectively used to help citywide traffic inference? In this paper, we propose a framework named CityVolInf to infer citywide traffic volume based on surveillance camera records. Our framework combines a semi-supervised learning-based similarity module with a novel simulation module to address the above challenges. While the similarity module focuses on spatiotemporal correlations of traffic volume between road segments, the simulation module utilizes incomplete trajectories to model transitions of traffic volume between adjacent road segments. Our framework bridges the conventional data-driven approach and transportation domain knowledge from the simulator. We conduct extensive experiments on a real-world dataset, containing 405, 370, 631 camera records collected from 1, 704 surveillance cameras over a period of 31 days in a provincial capital in China. The experimental results demonstrate the effectiveness of CityVolInf compared with existing methods.