With rising global temperatures, wildfires frequently occur worldwide during the summer season. The timely detection of these fires, based on unmanned aerial vehicle (UAV) images, can significantly reduce the damage they cause. Existing Convolutional Neural Network (CNN)-based fire detection methods usually use multiple convolutional layers to enhance the receptive fields, but this compromises real-time performance. This paper proposes a novel real-time semantic segmentation network called FireFormer, combining the strengths of CNNs and Transformers to detect fires. An agile ResNet18 as the encoding component tailored to fulfill the efficient fire segmentation is adopted here, and a Forest Fire Transformer Block (FFTB) rooted in the Transformer architecture is proposed as the decoding mechanism. Additionally, to accurately detect and segment small fire spots, we have developed a novel Feature Refinement Network (FRN) to enhance fire segmentation accuracy. The experimental results demonstrate that our proposed FireFormer achieves state-of-the-art performance on the publicly available forest fire dataset FLAME—specifically, with an impressive 73.13% IoU and 84.48% F1 Score.