Rolling bearing is a key component of rotating machines, its working state directly affects the performance and safety of the whole equipment. Deep learning based on big data is a mainstream means of intelligent mechanical fault diagnosis. The key lies in enhancing fault feature and improving diagnosis accuracy. Different from the Convolution Neural Network (CNN) which relies on the convolution layer to extract the image features, the Vision Transformer (VIT) uses the multi-head attention mechanism to establish the relationship among the pixels in an image. In order to improve the accuracy of rolling bearing fault diagnosis, a new fault diagnosis method based on VIT is proposed. The vibration gray texture images to be input are divided into the patches according to the predetermined size and linearly mapped into input sequences, and the global image information is integrated through the self-attention mechanism to realize fault diagnosis. In order to enhance the expressiveness and generalization ability, the pooling layer is introduced into VIT. The tested results show that the fault diagnosis accuracy of VIT on the test set reaches 94.6%, and the corresponding classification indexes top-1 is 84.2% and top-5 is 95.0%. The accuracy of the new Pooling Vision Transformer (PIT) is 3.3% higher than that of the original VIT, which proves that the introduction to pooling layer can improve the image identification performance of VIT.