Rice, a global staple crop, plays a crucial role in feeding approximately half of the global population. Nevertheless, the persistent spread of diseases poses a significant threat to rice production. Therefore, accurately identifying rice diseases is of paramount practical importance. The proposed approach introduces an innovative hybrid architecture for image classification, harnessing the strengths of both Vision Transformers (ViT) and Convolutional Neural Networks (CNNs). This research investigates five primary diseases affecting rice crops: Blast, Brown Spot, Tungro, False smut, and Bacterial Sheath Blight. Approximately 8000 images of these specific rice leaf diseases were employed for training purposes in the study. What distinguishes this method is its unique integration of a CNN block within the transformer layers, deviating from the traditional ViT architecture. Vision Transformers (ViTs), recognized for their exceptional performance in image classification, excel in providing global insights through attention-based mechanisms. Nevertheless, their model complexity can obscure the decision-making process, and ambiguous attention maps can lead to erroneous correlations among image patches. The incorporation of CNNs in this approach serves to address these challenges by effectively capturing local patterns. This synergistic combination enhances the model's robustness to variations in input data, such as changes in scale, perspective, or context. With the utilization of the proposed hybrid ViT-CNN model architecture, the model achieves remarkable results, boasting 100 percent accuracy and top-5 accuracy, along with a precision of 93.84 percent. Through this hybrid model, we have obtained satisfactory outcomes, surpassing the performance of the latest transformer models in the realm of rice leaf disease identification.