Probability estimation with machine learning methods for dichotomous and multicategory outcome: Applications

Kruppa, Jochen; Liu, Yufeng; Diener, H.‐C.; Holste, Theresa; Weimar, Christian; König, Inke R.; Ziegler, Andreas

doi:10.1002/bimj.201300077

Cited by 49 publications

(69 citation statements)

References 50 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…First, the R packages randomForest (Liaw and Wiener 2002), randomForestSRC (Ishwaran and Kogalur 2015) and Rborist (Seligman 2015), the C++ application Random Jungle (Schwarz et al 2010;Kruppa et al 2014b), and the R version of the new implementation ranger were run with small simulated datasets, a varying number of features p, sample size n, number of features tried for splitting (mtry) and a varying number of trees grown in the RF. In each case, the other three parameters were kept fixed to 500 trees, 1,000 samples, 1,000 features and mtry = √ p. The datasets mimic genetic data, consisting of p single nucleotide polymorphisms (SNPs) measured on n subjects.…”

Section: Runtime and Memory Usagementioning

confidence: 99%

“…This package is studied in greater detail in Section 5. Finally, an RF implementation optimized for analyzing high dimensional data is Random Jungle (Schwarz et al 2010;Kruppa et al 2014b). This package is only available as C++ application with library dependencies, and it is not portable to R or another statistical programming language.…”

Section: Introductionmentioning

confidence: 99%

“…Our primary aim was to develop a platform independent and modular framework for the analysis of high dimensional data with RF. Second, the package should be simple to use and available in R with a computational performance not worse than Random Jungle 2.1 (Kruppa et al 2014b). Here, we describe the new implementation, explain its usage and provide examples.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R

Wright¹,

Ziegler²

2017

J. Stat. Soft.

Self Cite

2,519

1,690

View full text Add to dashboard Cite

We introduce the C++ application and R package ranger. The software is a fast implementation of random forests for high dimensional data. Ensembles of classification, regression and survival trees are supported. We describe the implementation, provide examples, validate the package with a reference implementation, and compare runtime and memory usage with other implementations. The new software proves to scale best with the number of features, samples, trees, and features tried for splitting. Finally, we show that ranger is the fastest and most memory efficient implementation of random forests to analyze data on the scale of a genome-wide association study.

show abstract

Section: Runtime and Memory Usagementioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R

Wright¹,

Ziegler²

2017

J. Stat. Soft.

Self Cite

2,519

1,690

View full text Add to dashboard Cite

show abstract

“…It is usually required by surgeons, oncologists, pathologists, professionals involved in internal medicine and human genetics and pediatricians (Malley et al (2012)). For instance, carrier probabilities are calculated in genetic counseling and treatment response probability is estimated in personalized medicine of every patient , Kruppa et al (2014b)). …”

Section: Introductionmentioning

confidence: 99%

An Ensemble of Optimal Trees for Class Membership Probability Estimation

Khan

Gul

Mahmoud

et al. 2016

Analysis of Large and Complex Data

View full text Add to dashboard Cite

General rightsThis document is made available in accordance with publisher policies. Please cite only the published version using the reference above. Abstract. Machine learning methods can be used for estimating the class membership probability of an observation. We propose an ensemble of optimal trees in terms of their predictive performance. This ensemble is formed by selecting the best trees from a large initial set of trees grown by random forest. A proportion of trees is selected on the basis of their individual predictive performance on out-of-bag observations. The selected trees are further assessed for their collective performance on an independent training data set. This is done by adding the trees one by one starting from the highest predictive tree. A tree is selected for the final ensemble if it increases the predictive performance of the previously combined trees. The proposed method is compared with probability estimation tree, random forest and node harvest on a number of bench mark problems using Brier score as a performance measure. In addition to reducing the number of trees in the ensemble, our method gives better results in most of the cases. The results are supported by a simulation study.

show abstract

“…For example, in safety-critical domains such as surgery, oncology, internal medicine, pathology, paediatrics and human genetics, these probabilities are needed. In all the aforementioned areas, probability estimates are more useful than simple classification as they provide a measure of reliability of the decision taken about an individual (Lee et al (2010), Malley et al (2012), Kruppa et al (2012), Kruppa et al (2014aKruppa et al ( , 2014b). Machine learning techniques used mainly for classification can be used as non-parametric methods for class membership probability estimation in order to avoid the assumptions imposed in parametric models used for the estimation of these probabilities , Malley et al (2012)).…”

Section: Introductionmentioning

confidence: 99%

Ensemble of Subset of k-Nearest Neighbours Models for Class Membership Probability Estimation

Gul

Khan

Perperoglou

et al. 2016

Analysis of Large and Complex Data

View full text Add to dashboard Cite

Probability estimation with machine learning methods for dichotomous and multicategory outcome: Applications

Cited by 49 publications

References 50 publications

ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R

ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R

An Ensemble of Optimal Trees for Class Membership Probability Estimation

Ensemble of Subset of k-Nearest Neighbours Models for Class Membership Probability Estimation

Contact Info

Product

Resources

About