Aiming at rolling bearing fault diagnosis, the collected vibration signal contains complex noise interference, and one-dimensional information cannot be used to fully mine the data features of the problem. This paper proposes a rolling bearing fault diagnosis method based on SVD-GST combined with the Vision Transformer. Firstly, the one-dimensional vibration signal is preprocessed to reduce noise using singular value decomposition (SVD) to obtain a more accurate and useful signal. Then, the generalized S-transform (GST) is used to convert the processed one-dimensional vibration signal into a two-dimensional time–frequency image and make full use of the advantages of deep learning in image classification with higher recognition accuracy. In order to avoid the problem of limited sensory fields in CNN and the need for an RNN to compute step by step over time when processing sequence data, the use of a Vision Transformer model for pattern recognition classification is proposed. Finally, an experimental platform for the fault diagnosis of rolling bearings is built. The model is experimentally validated, achieving an average accuracy of 98.52% over multiple tests. Additionally, compared with the SVD-GST-2DCNN, STFT-CNN-LSTM, SVD-GST-LSTM, and GST-ViT fault diagnosis models, the proposed method has higher diagnostic accuracy and stability, providing a new method for rolling bearing fault diagnosis.