Network intrusion detection technology has always been an indispensable protection mechanism for industrial network security. The rise of new forms of network attacks has resulted in a heightened demand for these technologies. Nevertheless, the current models' effectiveness is subpar. We propose a new Deformable Vision Transformer (DE-VIT) method to address this issue. DE-VIT introduces a new deformable attention mechanism module, where the positions of key-value pairs in the attention mechanism are selected in a data-dependent manner, allowing it to focus on relevant areas, capture more informative features, and avoid excessive memory and computational costs. In addition to using deformable convolutions instead of regular convolutions in embedding layers to enhance the receptive field of patches, a sliding window mechanism is also employed to utilize edge information fully. In Parallel, we use a layered focal loss function to improve classification performance and address data imbalance issues. In summary, DE-VIT reduces computational complexity and achieves better results. We conduct experimental simulations on the public intrusion detection datasets, and the accuracy of the enhanced intrusion detection model surpasses that of the Deep Belief Network with Improved Kernel-Based Extreme Learning (DBN-KELM). It reaches 99.5% and 97.5% on the CIC IDS2017 and UNSW-NB15 datasets, exhibiting an increase of 8.5% and 9.1%, respectively.INDEX TERMS Network intrusion detection, deformable vision transformer, deformable convolution, deformable attention mechanism, vision transformer.