Machine learning techniques applied in neuroimaging have prompted researchers to build models for early diagnosis of brain illnesses such as Alzheimer’s disease (AD). Although this task is difficult, advanced deep-learning (DL) approaches can be used. These DL models are effective, but difficult to interpret, time-consuming, and resource-intensive. Therefore, neuroscientists are interested in employing novel, less complex structures such as transformers that have superior pattern-extraction capabilities. In this study, an automated framework for accurate AD diagnosis and precise stage identification was developed by employing vision transformers (ViTs) with fewer computational resources. ViT, which captures the global context as opposed to convolutional neural networks (CNNs) with local receptive fields, is more efficient for brain image processing than CNN because the brain is a highly complex network with connected parts. The self-attention mechanism in the ViT helps to achieve this goal. Magnetic resonance brain images belonging to four stages were utilized to develop the proposed model, which achieved 99.83% detection accuracy, 99.69% sensitivity, 99.88% specificity, and 0.17% misclassification rate. Moreover, to prove the ability of the model to generalize, the mean distances of the transformer blocks and attention heat maps were visualized to understand what the model learned from the MRI input image.