2021
DOI: 10.1021/acsestwater.1c00037
|View full text |Cite
|
Sign up to set email alerts
|

Identification of Suitable Technologies for Drinking Water Quality Prediction: A Comparative Study of Traditional, Ensemble, Cost-Sensitive, Outlier Detection Learning Models and Sampling Algorithms

Abstract: Drinking water quality data sets used in learning models have been highly imbalanced, which has weakened the prediction ability of models for drinking water quality. Although some efforts have been made to address the issue of imbalance, little is known about the suitable technologies for drinking water quality prediction. Here, a total of 16 common learning models were applied individually to compare the drinking water quality prediction performance based on a large-scale highly imbalanced drinking water qual… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

1
4
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
1
1

Relationship

0
6

Authors

Journals

citations
Cited by 7 publications
(5 citation statements)
references
References 40 publications
1
4
0
Order By: Relevance
“…The improvement over the baseline performance was obtained by changing only a single hyperparameter of the base learners of the ANN-EFS: the cost function. This is consistent with findings from several other fields including inventory management [41], flood modelling [42,49], fraud detection [50], epidemiology [52], and drinking water quality modelling [54,55], all of which have shown that changing the error metric and implementing cost-sensitive learning is much more effective than using standard symmetrical error metrics and cost insensitive learning. However, the present study shows this for the first time when using probabilistic EFS.…”
Section: Plos Watersupporting
confidence: 89%
See 3 more Smart Citations
“…The improvement over the baseline performance was obtained by changing only a single hyperparameter of the base learners of the ANN-EFS: the cost function. This is consistent with findings from several other fields including inventory management [41], flood modelling [42,49], fraud detection [50], epidemiology [52], and drinking water quality modelling [54,55], all of which have shown that changing the error metric and implementing cost-sensitive learning is much more effective than using standard symmetrical error metrics and cost insensitive learning. However, the present study shows this for the first time when using probabilistic EFS.…”
Section: Plos Watersupporting
confidence: 89%
“…In drinking water research, two recent studies have proposed ANN model building frameworks [80,81], however, neither of these studies include guidance on the selection of an appropriate cost function. Based on the substantial improvements in performance obtained in this study when training the ANN-EFS with the selected cost function as opposed to a default, as well as the improvements over cost-insensitive training obtained in other drinking water studies [54,55], we recommend that future model development frameworks for drinking water modelling should also include consideration for selection of an appropriate cost function.…”
Section: Plos Watermentioning
confidence: 89%
See 2 more Smart Citations
“…It is only after such an effort that we can build more meaningful models to address the research questions mentioned above. Second, when modeling drinking water quality in China based on six important indicators (pH, electric conductivity, turbidity, spectral absorption coefficient, water temperature, and pulse−frequency−modulation value of the sensor panel), Chen et al 8 discovered that, although there is a large amount of data on drinking water quality, the data sets are highly imbalanced; that is, the majority of the treated drinking water meets the requirements and there is a rare occurrence (1.79‰) of unqualified drinking water events. To still develop robust ML models, the authors reduced the degree of data set imbalance by employing different combinations of mixed sampling algorithms: the synthetic minority oversampling technique (SMOTE), the Tomek links technique (TLTE), and the edited nearest neighbor technique (ENNTE).…”
mentioning
confidence: 99%