Forest fires have the characteristics of strong unpredictability and extreme destruction. Hence, it is difficult to carry out effective prevention and control. Once the fire spreads, devastating damage will be caused to natural resources and the ecological environment. In order to detect early forest fires in real-time and provide firefighting assistance, we propose a vision-based detection and spatial localization scheme and develop a system carried on the unmanned aerial vehicle (UAV) with an OAK-D camera. During the high incidence of forest fires, UAVs equipped with our system are deployed to patrol the forest. Our scheme includes two key aspects. First, the lightweight model, NanoDet, is applied as a detector to identify and locate fires in the vision field. Techniques such as the cosine learning rate strategy and data augmentations are employed to further enhance mean average precision (mAP). After capturing 2D images with fires from the detector, the binocular stereo vision is applied to calculate the depth map, where the HSV-Mask filter and non-zero mean method are proposed to eliminate the interference values when calculating the depth of the fire area. Second, to get the latitude, longitude, and altitude (LLA) coordinates of the fire area, coordinate frame conversion is used along with data from the GPS module and inertial measurement unit (IMU) module. As a result, we experiment with simulated fire in a forest area to test the effectiveness of this system. The results show that 89.34% of the suspicious frames with flame targets are detected and the localization error of latitude and longitude is in the order of 10−5 degrees; this demonstrates that the system meets our precision requirements and is sufficient for forest fire inspection.