On the overestimation of random forest’s out-of-bag error

Janitza, Silke; Hornung, Roman

doi:10.1371/journal.pone.0201904

Cited by 192 publications

(109 citation statements)

References 39 publications

Supporting

Mentioning

105

Contrasting

Unclassified

Order By: Relevance

“…Although the two-step random forest model showed good performance, especially for Spain (Pearson's correlation coefficient of 0.90 for reservoir-based and 0.98 for run-of-river generation from 5-fold cross-validation), this validation method usually gives optimistic results [16]. One possible explanation is that despite being randomly permuted, the validation dataset in each cross-validation is still an average of multiple random samples from the same dataset, and thus the prediction is not totally independent of the training set.…”

Section: Validation With Independent Datasetsmentioning

confidence: 99%

Reconstruction of Multidecadal Country-Aggregated Hydro Power Generation in Europe Based on a Random Forest Model

Dubus

Felice

et al. 2020

Energies

View full text Add to dashboard Cite

Hydro power can provide a source of dispatchable low-carbon electricity and a storage solution in a climate-dependent energy mix with high shares of wind and solar production. Therefore, understanding the effect climate has on hydro power generation is critical to ensure a stable energy supply, particularly at a continental scale. Here, we introduce a framework using climate data to model hydro power generation at the country level based on a machine learning method, the random forest model, to produce a publicly accessible hydro power dataset from 1979 to present for twelve European countries. In addition to producing a consistent European hydro power generation dataset covering the past 40 years, the specific novelty of this approach is to focus on the lagged effect of climate variability on hydro power. Specifically, multiple lagged values of temperature and precipitation are used. Overall, the model shows promising results, with the correlation values ranging between 0.85 and 0.98 for run-of-river and between 0.73 and 0.90 for reservoir-based generation. Compared to the more standard optimal lag approach the normalised mean absolute error reduces by an average of 10.23% and 5.99%, respectively. The model was also implemented over six Italian bidding zones to also test its skill at the sub-country scale. The model performance is only slightly degraded at the bidding zone level, but this also depends on the actual installed capacity, with higher capacities displaying higher performance. The framework and results presented could provide a useful reference for applications such as pan-European (continental) hydro power planning and for system adequacy and extreme events assessments.Energies 2020, 13, 1786 2 of 17 for individual power plants. There are several explanations for this approach: (i) modelling tends to be thought of as a bottom-up approach, therefore starting from individual power plants (to e.g., assist with their operations) rather than at the entire country or pan-European level; (ii) the impacts of climate variables on hydro power are varied and complex, and many factors contribute to these impacts such as watershed characteristics, the river flow rate, the evaporation process, and human factors, and hence, using only climate data is not sufficient to predict hydro power generation at the desired accuracy for plant-level operational purposes; (iii) hydro power generation at the European level has only recently started to become publicly available (as further discussed below); (iv) the quality of climate data as provided by reanalysis (i.e., climate reconstructions), and particularly precipitation, has been improving significantly only in recent years.Wind and solar power play an increasingly important role in a low-carbon energy plan. However, their weather-dependent nature requires a more stable energy source, together with storage solutions, to cope with climate variability. Hydro power is a good candidate with its ability to dispatch or store electricity according to demand. For example, [4] sho...

show abstract

Section: Validation With Independent Datasetsmentioning

confidence: 99%

Reconstruction of Multidecadal Country-Aggregated Hydro Power Generation in Europe Based on a Random Forest Model

Dubus

Felice

et al. 2020

Energies

View full text Add to dashboard Cite

show abstract

“…To evaluate the accuracy for the Random Forest models, we used the out-of-bag training accuracy. While not strictly equivalent to explicit cross-validation, the accuracy metric provided by out-of-bag training accuracy is a sufficient proxy, even though for unbalanced datasets such as ours, the out-of-bag training accuracy underestimates the error rate (Janitza and Hornung, 2018); however, from our own experience, this underestimation is not significant. For precision, the classic definition of precision can be applied for sets of inputs on known label and their model-generated labeling.…”

Section: Evaluation Of Lipid Classification Performancementioning

confidence: 63%

Deriving Accurate Lipid Classification based on Molecular Formula

Mitchell

Moseley

2019

Preprint

View full text Add to dashboard Cite

IntroductionAlthough Fourier-transform mass spectrometry has substantially improved our ability to detect lipids and other metabolites; the untargeted and accurate assignment of detected metabolites remains an unsolved problem in metabolomics. New assignment methods such as our SMIRFE algorithm can assign elemental molecular formula to observed spectral features in an untargeted manner without orthogonal information from tandem MS or chromatography. However, for many lipidomics applications, it is necessary to know at least the lipid category or class that is associated with a detected spectral feature in order to derive biochemical interpretation. ObjectivesOur goal is to develop a method for robustly classifying elemental molecular formula assignments into lipid categories for application to SMIRFE-generated assignments. ResultsUsing machine learning, we developed a method that can predict lipid category and class from SMIRFE molecular formula assignments. Our methods achieve high accuracy (>90%) and precision (>83%) for all eight of the lipid categories in the LIPIDMAPS database. Model performance was evaluated using sets of theoretical, data-derived, and artifactual molecular formulas. Our models were generalizable, applicable to real-world datasets, and very discriminating with most molecular formulas classified to the "not lipid" category. Lipid categories with the highest classification propensities were glycerophospholipids and sphingolipids, matching the highest category prevalence in LIPIDMAPS. ConclusionsOur methods enable the lipid classification of untargeted molecular formula assignments generated by SMIRFE without orthogonal information, facilitating biochemical interpretation of highly untargeted lipidomics experiments. However, this lipid classification appears insufficient for validating single-spectrum assignments, but could be useful in cross-spectrum assignment validation. Author ContributionsJMM designed and implemented the machine learning models and the convex hull analysis. HNBM designed the m/z shifted formula analysis and JMM implemented it.The primary manuscript writers were JMM and HNBM. Abstract:These pages contain supporting information including descriptions of the tissue samples from which the experimental set of formulas was derived, additional result tables for the machine learning models, and a figure showing the distribution of molecular formulas across our training datasets with respect to m/z.

show abstract

“…While not strictly equivalent to explicit cross-validation, the accuracy metric provided by out-of-bag training accuracy is a sufficient proxy. In fact, for unbalanced datasets such as ours, the out-of-bag training accuracy can underestimate the error rate [63]. Therefore, the use of k-fold cross-validation with Random Forest would only reduce training sample size and could substantially overestimate the error.…”

Section: Evaluation Of Lipid Classification Performancementioning

confidence: 94%

Deriving Lipid Classification Based on Molecular Formulas

2020

View full text Add to dashboard Cite

Despite instrument and algorithmic improvements, the untargeted and accurate assignment of metabolites remains an unsolved problem in metabolomics. New assignment methods such as our SMIRFE algorithm can assign elemental molecular formulas to observed spectral features in a highly untargeted manner without orthogonal information from tandem MS or chromatography. However, for many lipidomics applications, it is necessary to know at least the lipid category or class that is associated with a detected spectral feature to derive a biochemical interpretation. Our goal is to develop a method for robustly classifying elemental molecular formula assignments into lipid categories for an application to SMIRFE-generated assignments. Using a Random Forest machine learning approach, we developed a method that can predict lipid category and class from SMIRFE non-adducted molecular formula assignments. Our methods achieve high average predictive accuracy (>90%) and precision (>83%) across all eight of the lipid categories in the LIPIDMAPS database. Classification performance was evaluated using sets of theoretical, data-derived, and artifactual molecular formulas. Our methods enable the lipid classification of non-adducted molecular formula assignments generated by SMIRFE without orthogonal information, facilitating the biochemical interpretation of untargeted lipidomics experiments. This lipid classification appears insufficient for validating single-spectrum assignments, but could be useful in cross-spectrum assignment validation.

show abstract

On the overestimation of random forest’s out-of-bag error

Cited by 192 publications

References 39 publications

Reconstruction of Multidecadal Country-Aggregated Hydro Power Generation in Europe Based on a Random Forest Model

Reconstruction of Multidecadal Country-Aggregated Hydro Power Generation in Europe Based on a Random Forest Model

Deriving Accurate Lipid Classification based on Molecular Formula

Deriving Lipid Classification Based on Molecular Formulas

Contact Info

Product

Resources

About