In this study, we introduce a transformative approach to achieve high-accuracy classification of distinct health categories, including Parkinson's disease, Multiple Sclerosis (MS), healthy individuals, and other categories, utilizing a transformer-based neural network. The cornerstone of this approach lies in the innovative conversion of human speech into spectrograms, which are subsequently transformed into visual images. This transformation process enables our network to capture intricate vocal patterns and subtle nuances that are indicative of various health conditions. The experimental validation of our approach underscores its remarkable performance, achieving exceptional accuracy in differentiating Parkinson's disease, MS, healthy subjects, and other categories. This breakthrough opens doors to potential clinical applications, offering an innovative, non-invasive diagnostic tool that rests on the fusion of spectrogram analysis and transformer-based models.