Efficient and accurate fault diagnosis of rotating machinery is extremely important. Fault diagnosis methods using vibration signals based on convolutional neural networks (CNNs) have become increasingly mature. They often struggle with capturing the temporal dynamics of vibration signals. To overcome this, the application of Transformer-based Vision Transformer (ViT) methods to fault diagnosis is gaining attraction. Nonetheless, these methods typically require extensive preprocessing, which increases computational complexity, potentially reducing the efficiency of the diagnosis process. Addressing this gap, this paper presents the Time Series Vision Transformer (TSViT), tailored for effective fault diagnosis. The TSViT incorporates a convolutional layer to extract local features from vibration signals alongside a transformer encoder to discern long-term temporal patterns. A thorough experimental comparison of three diverse datasets demonstrates the TSViT’s effectiveness and adaptability. Moreover, the paper delves into the influence of hyperparameter tuning on the model’s performance, computational demand, and parameter count. Remarkably, the TSViT achieves an unprecedented 100% average accuracy on two of the test sets and 99.99% on the other, showcasing its exceptional fault diagnosis capabilities for rotating machinery. The implementation of this model will bring significant economic benefits.