Background: Pes planus, commonly known as flatfoot, is a condition in which the medial arch of the foot is abnormally low or absent, leading to the inner part of the foot having less curvature than normal. Symptom recognition and errors in diagnosis are problems encountered in daily practice. Therefore, it is important to improve how a diagnosis is made. With the availability of large datasets, deep neural networks have shown promising capabilities in recognizing foot structures and accurately identifying pes planus. Methods: In this study, we developed a novel fusion model by combining the Vgg16 convolutional neural network (CNN) model with the vision transformer ViT-B/16 to enhance the detection of pes planus. This fusion model leverages the strengths of both the CNN and ViT architectures, resulting in improved performance compared to that in reports in the literature. Additionally, ensemble learning techniques were employed to ensure the robustness of the model. Results: Through a 10-fold cross-validation, the model demonstrated high sensitivity, specificity, and F1 score values of 97.4%, 96.4%, and 96.8%, respectively. These results highlight the effectiveness of the proposed model in quickly and accurately diagnosing pes planus, making it suitable for deployment in clinics or healthcare centers. Conclusions: By facilitating early diagnosis, the model can contribute to the better management of treatment processes, ultimately leading to an improved quality of life for patients.