As one of the important features in the early stage of fires, the detection of smoke can provide a faster early warning of a fire, thus suppressing the spread of the fire in time. However, the features of smoke are not apparent; the shape of smoke is not fixed, and it is easy to be confused with the background outdoors, which leads to difficulties in detecting smoke. Therefore, this study proposes a model called Smoke Detection Transformer (Smoke-DETR) for smoke detection, which is based on a Real-Time Detection Transformer (RT-DETR). Considering the limited computational resources of smoke detection devices, Enhanced Channel-wise Partial Convolution (ECPConv) is introduced to reduce the number of parameters and the amount of computation. This approach improves Partial Convolution (PConv) by using a selection strategy that selects channels containing more information for each convolution, thereby increasing the network’s ability to learn smoke features. To cope with smoke images with inconspicuous features and irregular shapes, the Efficient Multi-Scale Attention (EMA) module is used to strengthen the feature extraction capability of the backbone network. Additionally, in order to overcome the problem of smoke being easily confused with the background, the Multi-Scale Foreground-Focus Fusion Pyramid Network (MFFPN) is designed to strengthen the model’s attention to the foreground of images, which improves the accuracy of detection in situations where smoke is not well differentiated from the background. Experimental results demonstrate that Smoke-DETR has achieved significant improvements in smoke detection. In the self-building dataset, compared to RT-DETR, Smoke-DETR achieves a Precision that has reached 86.2%, marking an increase of 3.6 percentage points. Similarly, Recall has achieved 80 %, showing an improvement of 3.6 percentage points. In terms of mAP50, it has reached 86.2%, with a 3.8 percentage point increase. Furthermore, mAP50 has reached 53.9%, representing a 3.6 percentage point increase.