The frequent occurrence of forest fires causes irreparable damage to the environment and the economy. Therefore, the accurate detection of forest fires is particularly important. Due to the various shapes and textures of flames and the large variation in the target scales, traditional forest fire detection methods have high false alarm rates and poor adaptability, which results in severe limitations. To address the problem of the low detection accuracy caused by the multi-scale characteristics and changeable morphology of forest fires, this paper proposes YOLOv5s-CCAB, an improved multi-scale forest fire detection model based on YOLOv5s. Firstly, coordinate attention (CA) was added to YOLOv5s in order to adjust the network to focus more on the forest fire features. Secondly, Contextual Transformer (CoT) was introduced into the backbone network, and a CoT3 module was built to reduce the number of parameters while improving the detection of forest fires and the ability to capture global dependencies in forest fire images. Then, changes were made to Complete-Intersection-Over-Union (CIoU) Loss function to improve the network’s detection accuracy for forest fire targets. Finally, the Bi-directional Feature Pyramid Network (BiFPN) was constructed at the neck to provide the model with a more effective fusion capability for the extracted forest fire features. The experimental results based on the constructed multi-scale forest fire dataset show that YOLOv5s-CCAB increases AP@0.5 by 6.2% to 87.7%, and the FPS reaches 36.6. This indicates that YOLOv5s-CCAB has a high detection accuracy and speed. The method can provide a reference for the real-time, accurate detection of multi-scale forest fires.