We propose an intrusion detection model based on an improved vision transformer (ViT). More specifically, the model uses an attention mechanism to process data, which overcomes the flaw of the short-term memory in recurrent neural network (RNN) and the difficulty of learning remote dependency in convolutional neural network. It supports parallelization and has a faster computing speed than RNN. A sliding window mechanism is presented to improve the capability of modeling local features for ViT. The hierarchical focal loss function is used to improve the classification effect, and solve the issue of the data imbalance. The public intrusion detection dataset NSL-KDD is used for experimental simulations. By experimental simulations, the accuracy is up to 99.68%, the false-positive rate is 0.22%, and the recall rate is 99.57%, which show that the improved ViT has better accuracy, false positive rate, and recall rate than existing intrusion detection models.