Background The rising global threat of diabetes demands timely detection to prevent its complications. Data scientists and practitioners are seen to be used AI and some other classification models on different aspects. Nevertheless, addressing missing data and outlier’s accurate predictions may be questionable. As such incorporating ML and AI for early diagnosis has gained attention. This study integrates medical knowledge and what types of advanced technology to develop a comprehensive diabetes classification model, focusing on handling missing values and outliers to achieve improved accuracy in early disease identification.Methods The researcher’s methodology prioritized meticulous data pre-processing to enhance analysis quality. To address missing data, the researchers utilized the missForest method, employing a multistage imputation process that minimizes data loss and distortions. Outlier detection relied on Mahalanobis squared distances, identifying anomalous data points. Instead of outright removal, the researchers strategically leveraged the missForest method, known for its robust imputation capabilities. Temporarily replacing outliers with missing values, this approach seamlessly integrated imputation. The ensuing hybrid data, minus extreme outliers and enriched via missForest, formed the foundation for subsequent analysis and modelling. Model selection and evaluation were performed on pre-processed data. This analysis incorporated two-step CV: initial dataset partition (80% training, 20% testing) and ten iterations of ten-fold cross-validation for model stability and parameter optimization. A diverse array of ML models—LogitBoost, mlpWeightDecayML, avNNet, and others—were assessed. Metrics such as sensitivity, specificity, precision, recall, F1-score, AUC, accuracy, and Kappa score were scrutinized.Results Among the models examined, LogitBoost emerged as a strong contender with a sensitivity of 0.8095, specificity of 0.9464, precision of 0.85, recall of 0.8095, F1-score of 0.8293, AUC of 0.7888, accuracy of 0.9091, and Kappa score of 0.7674. However, the comparative results showcase varying performances across different metrics and models. Sensitivity ranged from 0.6792 to 0.9057, specificity from 0.6 to 0.9464, and precision from 0.5455 to 0.85.Conclusions In summation, the methodical approach has illuminated the path toward enhanced diabetes classification accuracy. By diligently addressing missing values through the robust missForest method and tactfully managing outliers using the hybrid approach, the researchers have elevated the integrity and quality of the PIMA dataset. This strategic handling of missing values and outliers has not only fortified the dataset against potential distortions but has also culminated in improved accuracy in diabetes classification. Through the synergy of meticulous pre-processing, strategic outlier management, and comprehensive model evaluation, the researchers have contributed valuable insights into the realm of early diabetes detection.