The normal operation of rolling bearings is crucial to the performance and reliability of rotating machinery. However, the collected vibration signals are often mixed with complex noise, and the transformer network cannot fully extract the characteristics of the vibration signals. To solve this problem, we propose a data preprocessing method that utilizes singular value decomposition (SVD) and continuous wavelet transform (CWT) along with an improved vision transformer (ViT) model for fault diagnosis. First, the SVD is applied to identify the noise components to improve the data quality. Then, the CWT is used to convert the denoised signal into a two-dimensional (2D) time–frequency representation (TFR) to display the fault features more intuitively. Finally, an improved multi-scale convolutional block attention module (MSCBAM) is embedded into the ViT network to extract fault features. Experimental results on the classical Case Western Reserve University (CWRU) dataset show that the average diagnostic accuracy of the proposed method is 99.3%. Compared with six other fault diagnosis methods, the method proposed in this paper has also achieved good diagnostic results on three other datasets, which can be effectively applied to the timely handling of problematic equipment and reduce downtime.