Background:
Although machine learning (ML)-based prediction of coronary
artery disease (CAD) has gained increasing attention, assessment of the severity
of suspected CAD in symptomatic patients remains challenging.
Methods:
The training set for this study consisted of 284 retrospective participants,
while the test set included 116 prospectively enrolled participants from whom we
collected 53 baseline variables and coronary angiography results. The data was
pre-processed with outlier processing and One-Hot coding. In the first stage, we
constructed a ML model that used baseline information to predict the presence of
CAD with a dichotomous model. In the second stage, baseline information was used
to construct ML regression models for predicting the severity of CAD. The non-CAD
population was included, and two different scores were used as output variables.
Finally, statistical analysis and SHAP plot visualization methods were employed
to explore the relationship between baseline information and CAD.
Results:
The study included 269 CAD patients and 131 healthy controls.
The eXtreme Gradient Boosting (XGBoost) model exhibited the best performance
amongst the different models for predicting CAD, with an area under the receiver
operating characteristic curve of 0.728 (95% CI 0.623–0.824). The main
correlates were left ventricular ejection fraction, homocysteine, and hemoglobin
(
p
0.001). The XGBoost model performed best for predicting the
SYNTAX score, with the main correlates being brain natriuretic peptide (BNP),
left ventricular ejection fraction, and glycated hemoglobin (
p
0.001). The main relevant features in the model predictive for the GENSINI score
were BNP, high density lipoprotein, and homocysteine (
p
0.001).
Conclusions:
This data-driven approach provides a foundation for the
risk stratification and severity assessment of CAD.
Clinical Trial Registration:
The study was registered in
www.clinicaltrials.gov
protocol
registration system (number NCT05018715).