A computationally fast variable importance test for random forests for high-dimensional data

Janitza, Silke; Celik, Ender; Boulesteix, Anne‐Laure

doi:10.1007/s11634-016-0276-4

Cited by 152 publications

(85 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The permutation variable importance measure used herein quantifies the loss in skill, the algorithm's ability to predict composite score based on structural indicator variables, by randomly permuting the values of a single predictor variable and comparing that to the unpermuted version. We use this metric to filter structural indicator variables not useful in understanding how structure impacts composite skill, i.e., those with negative values are excluded from further analysis (Janitza et al 2018). Using variable importance scores as a filter and then calculating gain for topmost splitting variables allows us to identify which structural attributes are most relevant to divergence and to quantify this dependence.…”

Section: Simulation Resultsmentioning

confidence: 99%

Divergence in land surface modeling: linking spread to structure

Schwalm

Schaefer

Fisher

et al. 2019

Environ. Res. Commun.

View full text Add to dashboard Cite

Divergence in land carbon cycle simulation is persistent and widespread. Regardless of model intercomparison project, results from individual models diverge significantly from each other and, in consequence, from reference datasets. Here we link model spread to structure using a 15-member ensemble of land surface models from the Multi-scale synthesis and Terrestrial Model Intercomparison Project (MsTMIP) as a test case. Our analysis uses functional benchmarks and model structure as predicted by model skill in a machine learning framework to isolate discrete aspects of model structure associated with divergence. We also quantify how initial conditions prejudice present-day model outcomes after centennial-scale transient simulations. Overall, the functional benchmark and machine learning exercises emphasize the importance of ecosystem structure in correctly simulating carbon and water cycling, highlight uncertainties in the structure of carbon pools, and advise against hard parametric limits on ecosystem function. We also find that initial conditions explain 90% of variation in global satellite-era values-initial conditions largely predetermine transient endpoints, historical environmental change notwithstanding. As MsTMIP prescribes forcing data and spin-up protocol, the range in initial conditions and high levels of predetermination are also structural. Our results suggest that methodological tools linking divergence to discrete aspects of model structure would complement current community best practices in model development.

show abstract

Section: Simulation Resultsmentioning

confidence: 99%

Divergence in land surface modeling: linking spread to structure

Schwalm

Schaefer

Fisher

et al. 2019

Environ. Res. Commun.

View full text Add to dashboard Cite

show abstract

“…In order to finally determine the relevant variables based on their permutation importance, several variable selection strategies have been proposed. In general, the Boruta method 43 and the Vita algorithm 44 can be recommended as they have shown to be well balanced in terms of sensitivity and specificity. 45 In the real-world data application, we are able to clarify which of the variables brought up by the original MOB might be truly predictive.…”

Section: Discussionmentioning

confidence: 99%

Exploratory identification of predictive biomarkers in randomized trials with normal endpoints

Krzykalla

Benner

Kopp‐Schneider

2019

Statistics in Medicine

View full text Add to dashboard Cite

One of the main endeavours in present‐day medicine, especially in oncological research, is to provide evidence for individual treatment decisions (“stratified medicine”). In the pursuit of optimal treatment decision rules, the identification of predictive biomarkers that modify the treatment effect is essential. Proposed methods have often been based on recursive partitioning since a wide variety of interaction patterns can be captured automatically and the results are easily interpretable. Furthermore, these methods are readily extendable to high‐dimensional settings by means of ensemble learning. In this article, we present predMOB, an adaptation of the model‐based recursive partitioning (MOB) for subgroup analysis approach specifically tailored to the identification of predictive factors. In a simulation study, predMOB outperforms the original MOB with respect to the number of false detections and shows to be more robust in moderately complex settings. Furthermore, we compare the results of predMOB for the application to a public data base of amyotrophic lateral sclerosis patients to those obtained from the original MOB and are able to elucidate the nature of the biomarkers' effects.

show abstract

“…Random forests are a machine learning technique, which can be used to find the variables – here proteins – that allow to predict which datasets or samples are similar (and which ones are not; Degenhardt et al , 2019). For variable importance calculation, we employed the method from (Janitza et al , 2018) as implemented in the ranger package. This method uses a heuristic approach, where a null distribution for p-value calculation is generated based on variables with importance scores of zero or negative importance scores.…”

Section: Methodsmentioning

confidence: 99%

Metabolic differences between symbiont subpopulations in the deep-sea tubewormRiftia pachyptila

Hinzke

Kleiner

Meister

et al. 2020

Preprint

View full text Add to dashboard Cite

The hydrothermal vent tube worm Riftia pachyptila lives in intimate symbiosis with intracellular sulfur-oxidizing gammaproteobacteria. Although the symbiont population consists of a single 16S rRNA phylotype, bacteria in the same host animal exhibit a remarkable degree of metabolic diversity: They simultaneously utilize two carbon fixation pathways and various energy sources and electron acceptors. Whether these multiple metabolic routes are employed in the same symbiont cells, or rather in distinct symbiont subpopulations, was unclear. As Riftia symbionts vary considerably in cell size and shape, we enriched individual symbiont cell sizes by density gradient centrifugation in order to test whether symbiont cells of different sizes show different metabolic profiles. Metaproteomic analysis and statistical evaluation using clustering and random forests, supported by microscopy and flow cytometry, strongly suggest that Riftia symbiont cells of different sizes represent metabolically dissimilar stages of a physiological differentiation process: Small symbionts actively divide and may establish cellular symbiont-host interaction, as indicated by highest abundance of the cell division key protein FtsZ and highly abundant chaperones and porins in this initial phase. Large symbionts, on the other hand, apparently do not divide, but still replicate DNA, leading to DNA endoreduplication. Highest abundance of enzymes for CO2 fixation, carbon storage and biosynthesis in large symbionts indicates that in this late differentiation stage the symbiont’s metabolism is efficiently geared towards the production of organic material. We propose that this division of labor between smaller and larger symbionts benefits the productivity of the symbiosis as a whole.

show abstract

A computationally fast variable importance test for random forests for high-dimensional data

Cited by 152 publications

References 37 publications

Divergence in land surface modeling: linking spread to structure

Divergence in land surface modeling: linking spread to structure

Exploratory identification of predictive biomarkers in randomized trials with normal endpoints

Metabolic differences between symbiont subpopulations in the deep-sea tubewormRiftia pachyptila

Contact Info

Product

Resources

About