Tree-based machine learning models such as random forests, decision trees, and gradient boosted trees are popular non-linear predictive models, yet comparatively little attention has been paid to explaining their predictions. Here, we improve the interpretability of tree-based models through three main contributions: 1) The first polynomial time algorithm to compute optimal explanations based on game theory. 2) A new type of explanation that directly measures local feature interaction effects. 3) A new set of tools for understanding global model structure based on combining many local explanations of each prediction. We apply these tools to three medical machine learning problems and show how combining many high-quality local explanations allows us to represent global structure while retaining local faithfulness to the original model. These tools enable us to i) identify high magnitude but low frequency non-linear mortality risk factors in the US population, ii) highlight distinct population subgroups with shared risk characteristics, iii) identify non-linear interaction effects among risk factors for chronic kidney disease, and iv) monitor a machine learning model deployed in a hospital by identifying which features are degrading the model's performance over time. Given the popularity of tree-based machine learning models, these improvements to their interpretability have implications across a broad set of domains.
Elevated serum urate levels cause gout, and correlate with cardio-metabolic diseases via poorly understood mechanisms. We performed a trans-ethnic genome-wide association study of serum urate among 457,690 individuals, identifying 183 loci (147 novel) that improve prediction of gout in an independent cohort of 334,880 individuals. Serum urate showed significant genetic correlations with many cardio-metabolic traits, with genetic causality analyses supporting a substantial role for pleiotropy. Enrichment analysis, fine-mapping of urateassociated loci and co-localization with gene expression in 47 tissues implicated kidney and liver as main target organs and prioritized potentially causal genes and variants, including the transcriptional master regulators in liver and kidney, HNF1A and HNF4A. Experimental validation showed that HNF4A trans-activated the promoter of the major urate transporter ABCG2 in kidney cells, and that HNF4A p.Thr139Ile is a functional variant. Transcriptional coregulation within and across organs may be a general mechanism underlying the observed pleiotropy between urate and cardio-metabolic traits.
Increased levels of the urinary albumin-to-creatinine ratio (UACR) are associated with higher risk of kidney disease progression and cardiovascular events, but underlying mechanisms are incompletely understood. Here, we conduct trans-ethnic (n = 564,257) and European-ancestry specific meta-analyses of genome-wide association studies of UACR, including ancestry- and diabetes-specific analyses, and identify 68 UACR-associated loci. Genetic correlation analyses and risk score associations in an independent electronic medical records database (n = 192,868) reveal connections with proteinuria, hyperlipidemia, gout, and hypertension. Fine-mapping and trans-Omics analyses with gene expression in 47 tissues and plasma protein levels implicate genes potentially operating through differential expression in kidney (including TGFB1, MUC1, PRKCI, and OAF), and allow coupling of UACR associations to altered plasma OAF concentrations. Knockdown of OAF and PRKCI orthologs in Drosophila nephrocytes reduces albumin endocytosis. Silencing fly PRKCI further impairs slit diaphragm formation. These results generate a priority list of genes and pathways for translational research to reduce albuminuria.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.