Objective
Information value (IV) and machine learning models were used to analyze and predict the high-risk distribution of schistosomiasis, in order to provide scientific evidence for disease surveillance and control in China.
Methods
The local case distribution from schistosomiasis surveillance data in China between 2005 and 2019 was assessed based on 19 variables including climate, geography, and social economy. Seven models were built in three categories including IV, three machine learning models (logistic regression, LR; random forest, RF; generalized boosted model, GBM), and three coupled models (coupled model of information value and logistic regression, IV + LR; coupled model of information value and random forest, IV + RF; coupled model of information value and generalized boosted model, IV + GBM). Accuracy, AUC (area under the curve), and F1-score were used to evaluate the prediction performance of the models. The best model was selected to predict the risk distribution for schistosomiasis.
Results
IV + GBM had the highest prediction effect (accuracy = 0.878, AUC = 0.902, F1 = 0.920). The results of IV + GBM showed that the risk area for transmission comprised 4.66% of China, mainly distributed in the coastal regions of the middle and lower reaches of the Yangtze River, the Poyang Lake region, and the Dongting Lake region. Risk areas can be divided into low-risk (2.47%), medium-risk (1.35%), and high-risk (0.84%). High-risk areas are primarily distributed in eastern Changde, western Yueyang, northeastern Yiyang, middle Changsha of the Hunan Province, southern Jiujiang, northern Nanchang, northeastern Shangrao, eastern Yichun in Jiangxi Province, southern Jingzhou, southern Xiantao, middle Wuhan in Hubei Province, southern Anqing, northwestern Guichi, eastern Wuhu in Anhui Province, middle Meishan, northern Leshan, and the middle of Liangshan in Sichuan Province.
Conclusions
The risk of schistosomiasis transmission in China still exists, with high-risk areas relatively concentrated within regions. Coupled models of IV and machine learning provide for effective analysis and prediction, forming a scientific basis for surveillance and control within key areas.