Multimodal neuroimaging, combining data from different sources, has shown promise in the classification of the Alzheimer's disease (AD) stage. Existing multimodal neuroimaging fusion methods exhibit certain limitations, which require advancements to enhance their objective performance, sensitivity, and specificity for AD classification. This study uses the use of a Pareto‐optimal cosine color map to enhance classification performance and visual clarity of fused images. A mobile vision transformer (ViT) model, incorporating the swish activation function, is introduced for effective feature extraction and classification. Fused images from the Alzheimer's Disease Neuroimaging Initiative (ADNI), the Whole Brain Atlas (AANLIB), and Open Access Series of Imaging Studies (OASIS) datasets, obtained through optimized transposed convolution, are utilized for model training, while evaluation is achieved using images that have not been fused from the same databases. The proposed model demonstrates high accuracy in AD classification across different datasets, achieving 98.76% accuracy for Early Mild Cognitive Impairment (EMCI) versus LMCI, 98.65% for Late Mild Cognitive Impairment (LMCI) versus AD, 98.60% for EMCI versus AD, and 99.25% for AD versus Cognitive Normal (CN) in the ADNI dataset. Similarly, on OASIS and AANLIB, the precision of the AD versus CN classification is 99.50% and 96.00%, respectively. Evaluation metrics showcase the model's precision, recall, and F1 score for various binary classifications, emphasizing its robust performance.