With the popularization and application of WiFi technology, precise location in multi-layer indoor environments play an increasingly important role in current public places. However, there are still problems such as noise and multi-path effects caused by the propagation degradation of the Received Signal Strength Indicator (RSSI) in indoor scenarios and the shading of obstacles, which lead to a reduction in indoor location accuracy. In this paper, we propose a Vision Transformer indoor localization (VTIL) algorithm based on RSSI and Vision-Transformer (ViT). First, the RSSI fingerprint is normalized to the maximum and minimum values, and the Principal Component Analysis (PCA) method is used for feature extraction to reduce the influence of noise or unimportant features. Secondly, the RSSI fingerprint library is converted into an RSSI gray image library. Then the RSSI gray image is divided into several small blocks, and the position-coding input is performed in the form of a sequence. The ViT model divides the weight ratio of each block in the RSSI image to alleviate the impact of multi-path effects. According to the experimental results using public datasets, our method reduces the average distance estimation error by 37.26% compared with the currently known indoor fingerprint localization methods.