For the problems of fire detection models based on computer vision, such as long inference and training time, too many model parameters and low detection accuracy. We propose ES-YOLO, which can quickly and accurately detect flames and smoke. Firstly, the original YOLOv5s backbone network is replaced with EfficientNetV2, which reduces the computational complexity of the network and improves the detection accuracy. Secondly, replaces the CIoU loss function with SIoU, which speeds up the convergence of the model. Finally, 9-Mosaic data augmentation is designed to enrich the dataset. The experimental results on the PASCAL VOC2007 dataset demonstrate that the mAP@0.5 and recall of ES-YOLO are 20% and 15% higher than that of YOLOv5s, the size of the model are compressed to 1/2 of that of YOLOv5s. ES-YOLO meets the requirements of lightweight and real-time detection.