X‐ray security checks aim to detect contraband in luggage; however, the detection accuracy is hindered by the overlapping and significant size differences of objects in X‐ray images. To address these challenges, the authors introduce a novel network model named Multi‐Scale Feature Attention (MSFA)‐DEtection TRansformer (DETR). Firstly, the pyramid feature extraction structure is embedded into the self‐attention module, referred to as the MSFA. Leveraging the MSFA module, MSFA‐DETR extracts multi‐scale feature information and amalgamates them into high‐level semantic features. Subsequently, these features are synergised through attention mechanisms to capture correlations between global information and multi‐scale features. MSFA significantly bolsters the model's robustness across different sizes, thereby enhancing detection accuracy. Simultaneously, A new initialisation method for object queries is proposed. The authors’ foreground sequence extraction (FSE) module extracts key feature sequences from feature maps, serving as prior knowledge for object queries. FSE expedites the convergence of the DETR model and elevates detection accuracy. Extensive experimentation validates that this proposed model surpasses state‐of‐the‐art methods on the CLCXray and PIDray datasets.