Skin lesion classification is a pivotal process in dermatology, enabling the early detection and precise diagnosis of skin diseases, leading to improved patient outcomes. Deep learning has shown great potential for this task by leveraging its ability to learn complex patterns from images. However, diagnostic accuracy is compromised by exclusive reliance on single-modality images. This research work proposes an innovative framework that unifies a Vision Transformer model with transfer learning, channel attention mechanism, and ROI for the accurate detection of skin conditions, including skin cancer. The proposed approach blends computer vision and machine-learning techniques, leveraging a comprehensive dataset comprised of macroscopic dermoscopic images, appended with patient metadata. Compared with conventional techniques, the proposed methodology exhibits significant improvements in various parameters, including sensitivity, specificity, and precision. Moreover, it demonstrates outstanding performance in realworld datasets, reinforcing its potential for clinical implementation. With a remarkable accuracy of 99%, the method outperforms existing approaches. Overall, this investigation underscores the transformative impact of deep learning and multimodal data analysis in the dermoscopic domain, projecting substantial headway into the field of skin lesion analytic diagnosis.INDEX TERMS Skin Lesion classification, dermatology, deep learning , multimodal data analysis, transfer learning, vision transformer.