Fungi play a pivotal role in our ecosystem and human health, serving as both essential contributors to environmental sustainability and significant agents of disease. The importance of precise fungi detection cannot be overstated, as it underpins effective disease management, agricultural productivity, and the safeguarding of global food security. This research explores the efficacy of vision transformer-based architectures for the classification of microscopic fungi images of various fungal types to enhance the detection of fungal infections. The study compared the pre-trained base Vision Transformer (ViT) and Swin Transformer models, evaluating their capability in feature extraction and fine-tuning. The incorporation of transfer learning and fine-tuning strategies, particularly with data augmentation, significantly enhances model performance. Utilizing a comprehensive dataset with and without data augmentation, the study reveals that Swin Transformer, particularly when fine-tuned, exhibits superior accuracy (98.36%) over ViT model (96.55%). These findings highlight the potential of vision transformer-based models in automating and refining the diagnosis of fungal infections, promising significant advancements in medical imaging analysis.