Refining interaction search through signed iterative Random Forests

Kumbier, Karl; Basu, Sumanta; Brown, James B.; Celniker, S; Yu, Bin

doi:10.1101/467498

Cited by 17 publications

(34 citation statements)

References 43 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Interactions are important as ML models are often highly nonlinear and learn complex interactions between features. Methods exist to extract interactions from many ML models, including random forests (21,57,58) and neural networks (59,60). In the below example, the descriptive accuracy of random forests is increased by extracting Boolean interactions (a problem-relevant form of interpretation) from a trained model.…”

Section: Post Hoc Interpretabilitymentioning

confidence: 99%

“…A previous line of work considers the problem of searching for biological interactions associated with important biological processes (21,57). To identify candidate biological interactions, the authors train a series of iteratively reweighted random forests (RFs) and search for stable combinations of features that frequently co-occur along the predictive RF decision paths.…”

Section: Post Hoc Interpretabilitymentioning

confidence: 99%

See 1 more Smart Citation

Definitions, methods, and applications in interpretable machine learning

Murdoch

Singh

Kumbier

et al. 2019

Proc. Natl. Acad. Sci. U.S.A.

Self Cite

1,364

723

View full text Add to dashboard Cite

Machine-learning models have demonstrated great success in learning complex patterns that enable them to make predictions about unobserved data. In addition to using models for prediction, the ability to interpret what a model has learned is receiving an increasing amount of attention. However, this increased focus has led to considerable confusion about the notion of interpretability. In particular, it is unclear how the wide array of proposed interpretation methods are related and what common concepts can be used to evaluate them. We aim to address these concerns by defining interpretability in the context of machine learning and introducing the predictive, descriptive, relevant (PDR) framework for discussing interpretations. The PDR framework provides 3 overarching desiderata for evaluation: predictive accuracy, descriptive accuracy, and relevancy, with relevancy judged relative to a human audience. Moreover, to help manage the deluge of interpretation methods, we introduce a categorization of existing techniques into model-based and post hoc categories, with subgroups including sparsity, modularity, and simulatability. To demonstrate how practitioners can use the PDR framework to evaluate and understand interpretations, we provide numerous real-world examples. These examples highlight the often underappreciated role played by human audiences in discussions of interpretability. Finally, based on our framework, we discuss limitations of existing methods and directions for future work. We hope that this work will provide a common vocabulary that will make it easier for both practitioners and researchers to discuss and choose from the full range of interpretation methods.interpretability | machine learning | explainability | relevancy

show abstract

Section: Post Hoc Interpretabilitymentioning

confidence: 99%

Section: Post Hoc Interpretabilitymentioning

confidence: 99%

Definitions, methods, and applications in interpretable machine learning

Murdoch

Singh

Kumbier

et al. 2019

Proc. Natl. Acad. Sci. U.S.A.

Self Cite

1,364

723

View full text Add to dashboard Cite

show abstract

“…Dropout in neural networks is a form of algorithm perturbation that leverages stability to improve generalizability (43). Our previous work (34) stabilizes random forests to interpret decision rules in tree ensembles (34,44), which are perturbed using random feature selection (model perturbation) and bootstrap (data perturbation).…”

Section: D2 Data Perturbationmentioning

confidence: 99%

Veridical data science

Kumbier

2020

Proc. Natl. Acad. Sci. U.S.A.

Self Cite

103

View full text Add to dashboard Cite

Building and expanding on principles of statistics, machine learning, and scientific inquiry, we propose the predictability, computability, and stability (PCS) framework for veridical data science. Our framework, comprised of both a workflow and documentation, aims to provide responsible, reliable, reproducible, and transparent results across the entire data science life cycle. The PCS workflow uses predictability as a reality check and considers the importance of computation in data collection/storage and algorithm design. It augments predictability and computability with an overarching stability principle for the data science life cycle. Stability expands on statistical uncertainty considerations to assess how human judgment calls impact data results through data and model/algorithm perturbations. Moreover, we develop inference procedures that build on PCS, namely PCS perturbation intervals and PCS hypothesis testing, to investigate the stability of data results relative to problem formulation, data cleaning, modeling decisions, and interpretations. We illustrate PCS inference through neuroscience and genomics projects of our own and others and compare it to existing methods in high dimensional, sparse linear model simulations. Over a wide range of misspecified simulation models, PCS inference demonstrates favorable performance in terms of ROC curves. Finally, we propose PCS documentation based on R Markdown or Jupyter Notebook, with publicly available, reproducible codes and narratives to back up human choices made throughout an analysis. The PCS workflow and documentation are demonstrated in a genomics case study available on Zenodo (1).

show abstract

“…The iterative Random Forest algorithm (iRF), and corresponding iRF R package, take a step towards addressing these issues with a computationally tractable approach to search for important interactions in a fitted random forest ,Kumbier, Basu, Brown, Celniker, & Yu (2018). Our algorithm grows a series of feature weighted random forests (Amaratunga, Cabrera, & Lee, 2008) to perform soft regularization on the model based on predictive features.…”

Section: Discussionmentioning

confidence: 99%