2018
DOI: 10.1186/s12859-018-2264-5
|View full text |Cite
|
Sign up to set email alerts
|

Random forest versus logistic regression: a large-scale benchmark experiment

Abstract: Background and goalThe Random Forest (RF) algorithm for regression and classification has considerably gained popularity since its introduction in 2001. Meanwhile, it has grown to a standard classification approach competing with logistic regression in many innovation-friendly scientific fields.ResultsIn this context, we present a large scale benchmarking experiment based on 243 real datasets comparing the prediction performance of the original version of RF with default parameters and LR as binary classificat… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

18
358
0
7

Year Published

2019
2019
2023
2023

Publication Types

Select...
8

Relationship

0
8

Authors

Journals

citations
Cited by 583 publications
(425 citation statements)
references
References 35 publications
(48 reference statements)
18
358
0
7
Order By: Relevance
“…Although advanced analytic methods and traditional regression models have comparable discrimination, model performance is often influenced by both the size of the cohort under study and the number of events per variable (EPV) . Evidence suggests that logistic regression models perform better (in terms of accuracy, parsimony and/or discrimination) in smaller datasets with approximately 20‐50 EPV, while random forest models perform well with larger sample sizes and achieve sufficient stability when EPV exceeds 200 . In addition, the number and type of variables selected as predictors affect model performance.…”
Section: Discussionmentioning
confidence: 99%
See 2 more Smart Citations
“…Although advanced analytic methods and traditional regression models have comparable discrimination, model performance is often influenced by both the size of the cohort under study and the number of events per variable (EPV) . Evidence suggests that logistic regression models perform better (in terms of accuracy, parsimony and/or discrimination) in smaller datasets with approximately 20‐50 EPV, while random forest models perform well with larger sample sizes and achieve sufficient stability when EPV exceeds 200 . In addition, the number and type of variables selected as predictors affect model performance.…”
Section: Discussionmentioning
confidence: 99%
“…In addition, the number and type of variables selected as predictors affect model performance. Previous research has demonstrated that random forest models achieve high performance not only as more variables are selected, but also when a large number of continuous variables are used as predictors . Given these considerations, data composition, quality, and completeness should be of the utmost importance when selecting or merging clinical data sources for prediction modeling.…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…Random forest, a tree-based ensemble algorithm, is a widely used machine-learning algorithm [23][24][25] and has proven to be a robust machine-learning method that is popular in many applications. [26][27][28][29] Pharmacokinetic (nivolumab clearance) and pharmacodynamic (inflammatory cytokine panel) data from 468 patients with melanoma treated with nivolumab in two phase III studies (CheckMate 066 and 067) were used as training data sets. Summary statistics of cytokines used in the development of the machine-learning model are provided in Table S1.…”
Section: Machine-learning Model Developmentmentioning
confidence: 99%
“…In the next sections, we use these chosen variables within a credit risk model and assess their effect on the prediction of default. We do so both by using a standard logistic regression approach, which transparently quantifies the effect of each individual predictor on credit risk and which is traditionally used in the credit risk industry, and a random-forest approach (Breiman, 2001), which is a more flexible machine learning method based on an ensemble of decision trees which has been found to perform well in many classification tasks and under various benchmark studies (Couronné et al, 2018). Table 4 shows the results of a logistic regression analysis.…”
Section: Network-based Predictors Of the Credit Risk Of Small And Medmentioning
confidence: 99%