Background
Metabolic syndrome (MetS) is a prevalent multifactorial disorder that can increase the risk of developing diabetes, cardiovascular diseases, and cancer. We aimed to compare different machine learning classification methods in predicting metabolic syndrome status as well as identifying influential genetic or environmental risk factors.
Methods
This candidate gene study was conducted on 4756 eligible participants from the Tehran Cardio-metabolic Genetic study (TCGS). We compared predictive models using logistic regression (LR), Random Forest (RF), decision tree (DT), support vector machines (SVM), and discriminant analyses. Demographic and clinical features, as well as variables regarding common GCKR gene polymorphisms, were included in the models. We used a 10-repeated tenfold cross-validation to evaluate model performance.
Results
50.6% of participants had MetS. MetS was significantly associated with age, gender, schooling years, BMI, physical activity, rs780094, and rs780093 (P < 0.05) as indicated by LR. RF showed the best performance overall (AUC-ROC = 0.804, AUC-PR = 0.776, and Accuracy = 0.743) and indicated BMI, physical activity, and age to be the most influential model features. According to the DT, a person with BMI < 24 and physical activity < 8.8 possesses a 4% chance for MetS. In contrast, a person with BMI ≥ 25, physical activity < 2.7, and age ≥ 33, has 77% probability of suffering from MetS.
Conclusion
Our findings indicated that, on average, machine learning models outperformed conventional statistical approaches for patient classification. These well-performing models may be used to develop future support systems that use a variety of data sources to identify persons at high risk of getting MetS.