2022
DOI: 10.1101/2022.04.06.487300
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Thresholding Gini Variable Importance with a single trained Random Forest: An Empirical Bayes Approach

Abstract: Random Forests (RF) are a very widely used modelling tool. Lundberg et al. (2019) concludes that no nonlinear model had a more widespread popularity, from health care to academia to industry, than random forests and decision trees. The bounds of the ethodology are still being extended. Bayat et al. (2020) give an example with 80 million variables. It is highly desirable that RF models be made more interpretable and a large part of that is a better understanding of the characteristics of the variable importance… Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
1
1
1

Relationship

1
2

Authors

Journals

citations
Cited by 3 publications
(4 citation statements)
references
References 54 publications
0
4
0
Order By: Relevance
“…Due to the size of the data, Variant Spark, a cloudbased implementation of RF was used instead as it has been shown to be more efficient than ranger [8]. Another benefit of Variant Spark is the implemented Local FDR approach that calculates threshold values for features [11], allowing us to select SNPs based on their involvement in tree building and significance.…”
Section: Random Forest Algorithm and Feature Selectionmentioning
confidence: 99%
“…Due to the size of the data, Variant Spark, a cloudbased implementation of RF was used instead as it has been shown to be more efficient than ranger [8]. Another benefit of Variant Spark is the implemented Local FDR approach that calculates threshold values for features [11], allowing us to select SNPs based on their involvement in tree building and significance.…”
Section: Random Forest Algorithm and Feature Selectionmentioning
confidence: 99%
“…While this score can rank variants by importance, it is unable to determine significantly associated variants. To determine significance from importance scores, we used a recently developed method [22]. Briefly, this approach is based on the empirical Bayes method [56] which uses RF tree information as a threshold to fit a skew normal distribution and correct for multiple testing akin to Efron's local false discovery rate approach.…”
Section: P-value Calculationmentioning
confidence: 99%
“…21 Alzheimer's Association, Illinois, USA. 22 University Of Pittsburgh, Pennsylvania, USA. 23 Cornell University, New York, USA.…”
Section: Data Availability Statementmentioning
confidence: 99%
See 1 more Smart Citation