Drinking water quality data sets used in learning models have been
highly imbalanced, which has weakened the prediction ability of models
for drinking water quality. Although some efforts have been made to
address the issue of imbalance, little is known about the suitable
technologies for drinking water quality prediction. Here, a total
of 16 common learning models were applied individually to compare
the drinking water quality prediction performance based on a large-scale
highly imbalanced drinking water quality data set. Our results showed
that ensemble, cost-sensitive learning models with higher F1-scores
were more suitable for predicting drinking water quality, compared
to other models tested in this study. In addition, the learning model
performance could be enhanced by the introduction of two mainstream
sampling algorithms [synthetic minority oversampling technique (SMOTE)
combined with the Tomek links technique (TLTE) or the edited nearest
neighbor technique (ENNTE), SMOTE + TLTE or SMOTE + ENNTE, respectively].
In particular, the F1-scores of deep cascade forest (DCF) with SMOTE
+ TLTE or SMOTE + ENNTE reached 94.54 ± 2.51% and 94.68 ±
2.72%, respectively. As a consequence, DCF with these two sampling
algorithms has greater potential to be applied in drinking water quality
monitoring and prediction, as well as other fields that have suffered
from issues of imbalanced data.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.