2021
DOI: 10.3389/fenvs.2021.701288
|View full text |Cite
|
Sign up to set email alerts
|

Comparison of Resampling Algorithms to Address Class Imbalance when Developing Machine Learning Models to Predict Foodborne Pathogen Presence in Agricultural Water

Abstract: Recent studies have shown that predictive models can supplement or provide alternatives to E. coli-testing for assessing the potential presence of food safety hazards in water used for produce production. However, these studies used balanced training data and focused on enteric pathogens. As such, research is needed to determine 1) if predictive models can be used to assess Listeria contamination of agricultural water, and 2) how resampling (to deal with imbalanced data) affects performance of these models. To… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
6
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
8
1

Relationship

0
9

Authors

Journals

citations
Cited by 14 publications
(6 citation statements)
references
References 48 publications
(101 reference statements)
0
6
0
Order By: Relevance
“…Since growing environments are complex, and past studies have shown that environmental factors can interact to affect odds of pathogen detection, conditional forest analysis was used to characterize hierarchical associations between odds of Salmonella, L, LM, and LS isolation and environmental factors using the mlr and party packages, as previously described (71,72). Briefly, a 3-fold cross-validation repeated 10 times was used to tune the number of variables considered in each split, to determine the maximum depth of the tree, and to assess model fit.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…Since growing environments are complex, and past studies have shown that environmental factors can interact to affect odds of pathogen detection, conditional forest analysis was used to characterize hierarchical associations between odds of Salmonella, L, LM, and LS isolation and environmental factors using the mlr and party packages, as previously described (71,72). Briefly, a 3-fold cross-validation repeated 10 times was used to tune the number of variables considered in each split, to determine the maximum depth of the tree, and to assess model fit.…”
Section: Methodsmentioning
confidence: 99%
“…Briefly, a 3-fold cross-validation repeated 10 times was used to tune the number of variables considered in each split, to determine the maximum depth of the tree, and to assess model fit. The synthetic minority oversampling technique (SMOTE) for resampling was used to address class imbalance in the outcomes (72). The relative strength of associations between each covariate and odds of isolating the microbial target were quantified using conditional variable importance.…”
Section: Methodsmentioning
confidence: 99%
“…Since the error of prediction in the minority class is dominated by the error of prediction in the majority class due to the data size difference, the accuracy of prediction in the minority class is obviously poorer than the accuracy of the majority class. The resampling method has been proven to be a feasible solution to this problem [49]. In this paper, we propose a method that can be used to select additional data from the minority class.…”
Section: Discussionmentioning
confidence: 99%
“…Briefly, hyperparameters were tuned, models trained, and performance assessed using three-fold cross-validation repeated 10 times; cross-validation was performed to optimize R-squared (for continuous outcomes) and area under the curve (AUC; for binary outcomes). For imbalanced, binary outcomes (i.e., Giardia detection versus non-detection), SMOTE resampling was performed as part of tuning (Kuhn and Johnson, 2016;Weller et al, 2021).…”
Section: Discussionmentioning
confidence: 99%