2018
DOI: 10.1371/journal.pone.0201904
|View full text |Cite
|
Sign up to set email alerts
|

On the overestimation of random forest’s out-of-bag error

Abstract: The ensemble method random forests has become a popular classification tool in bioinformatics and related fields. The out-of-bag error is an error estimation technique often used to evaluate the accuracy of a random forest and to select appropriate values for tuning parameters, such as the number of candidate predictors that are randomly drawn for a split, referred to as mtry. However, for binary classification problems with metric predictors it has been shown that the out-of-bag error can overestimate the tru… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
105
0
2

Year Published

2019
2019
2024
2024

Publication Types

Select...
8
1
1

Relationship

0
10

Authors

Journals

citations
Cited by 192 publications
(109 citation statements)
references
References 39 publications
2
105
0
2
Order By: Relevance
“…Although the two-step random forest model showed good performance, especially for Spain (Pearson's correlation coefficient of 0.90 for reservoir-based and 0.98 for run-of-river generation from 5-fold cross-validation), this validation method usually gives optimistic results [16]. One possible explanation is that despite being randomly permuted, the validation dataset in each cross-validation is still an average of multiple random samples from the same dataset, and thus the prediction is not totally independent of the training set.…”
Section: Validation With Independent Datasetsmentioning
confidence: 99%
“…Although the two-step random forest model showed good performance, especially for Spain (Pearson's correlation coefficient of 0.90 for reservoir-based and 0.98 for run-of-river generation from 5-fold cross-validation), this validation method usually gives optimistic results [16]. One possible explanation is that despite being randomly permuted, the validation dataset in each cross-validation is still an average of multiple random samples from the same dataset, and thus the prediction is not totally independent of the training set.…”
Section: Validation With Independent Datasetsmentioning
confidence: 99%
“…To evaluate the accuracy for the Random Forest models, we used the out-of-bag training accuracy. While not strictly equivalent to explicit cross-validation, the accuracy metric provided by out-of-bag training accuracy is a sufficient proxy, even though for unbalanced datasets such as ours, the out-of-bag training accuracy underestimates the error rate (Janitza and Hornung, 2018); however, from our own experience, this underestimation is not significant. For precision, the classic definition of precision can be applied for sets of inputs on known label and their model-generated labeling.…”
Section: Evaluation Of Lipid Classification Performancementioning
confidence: 63%
“…While not strictly equivalent to explicit cross-validation, the accuracy metric provided by out-of-bag training accuracy is a sufficient proxy. In fact, for unbalanced datasets such as ours, the out-of-bag training accuracy can underestimate the error rate [63]. Therefore, the use of k-fold cross-validation with Random Forest would only reduce training sample size and could substantially overestimate the error.…”
Section: Evaluation Of Lipid Classification Performancementioning
confidence: 94%