Causal discovery methods provide a means to ascertain causal attribution from observational data. Causal modeling at scale requires a method to populate models with relevant domain knowledge. We propose to use the biomedical literature to perform feature selection for drug/adverse drug event (ADE) models with clinical observational data derived from electronic health records (EHR) as our primary input data source. We reason that spurious (non-causal) drug-ADE associations from co-occurrence-based analyses should diminish conditional on sets of validated confounders identified in the literature. To evaluate this hypothesis, we used a publicly available reference data set to test the proposed methodology with 4 ADEs and 399 drug-ADE pairs. We calculated baseline scores using the rank order regression coefficients each drug-ADE pair. We then identified confounding variable candidates for each drug-ADE pair using relationship constraints based on normalized predicates to search knowledge extracted from the literature in the publicly available SemMedDB repository. To determine eligibility for inclusion, we checked whether or not there were directed edges pointing to both the drug and the ADE. Finally, we tested whether associations from co-occurrence in the clinical data are diminished conditional on sets of permutations of confounders identified in the literature. Confounder yield rate was ~ 90%, indicating that our method successfully identified confounders in the observational data. Causal models attained aggregate performance improvements of ~ 0.07 area under the curve and reduced the False Discovery Rate from 0.50 to 0.38 over purely statistical models using unadjusted logistic regression.
IntroductionCausal feature selection entails identifying confounders that eliminate confounding bias when estimating effects from observational data. Traditionally, researchers employ expertise and literature review to identify confounders. Uncontrolled confounding from unidentified confounders threatens validity while conditioning on intermediate variables (mediators) weakens estimates, and conditioning on common effects (colliders) induces bias. Additionally, erroneously conditioning on variables playing multiple roles introduces bias. In a use case studying depression as a potential independent risk factor for Alzheimer’s disease (AD), we introduce a novel knowledge graph application enabling causal feature selection from computable literature-derived knowledge and biomedical ontologies to address these challenges.MethodsUsing the output from three machine reading systems, we harmonized the computable knowledge extracted from a scoped literature corpus. Next, we applied logical closure operations to infer missing knowledge and mapped the outputs to target terminologies. We then combined the outputs with ontology-grounded resources using a robust KG framework developed by computational biologists. Next, we translated epidemiological definitions of confounder, collider, and mediator into queries for searching the KG and summarized the roles played by the variables identified. Finally, we analyzed a selection of variables and reasoning paths in the search results.ResultsConfounder search yielded 128 confounders, including 58 phenotypes, 47 drugs, and 35 genes. Search also identified 23 collider and 16 mediator phenotypes. Only 31 of the 58 confounder phenotypes were found to behave exclusively as confounders. The remaining 27 phenotypes also play other roles, and 7 of the 21 confounders identified by both the KG and the literature were identified as being exclusively confounders. Stroke was an example of a variable playing all three roles.DiscussionOur findings suggest that our KG application could augment human expertise while confirming the complexity of selecting potential confounders for depression with AD. Imperfect concept mapping introduced errors, and the small literature corpus limited the scope of search results.ConclusionOur results suggest that our method may widely apply to causal feature selection. However, the search results need to be reviewed by human experts and tested empirically, and further work is required to optimize KG output for human consumption.Highlights•Knowledge of causal variables and their roles is essential for causal inference.•We show how to search a knowledge graph (KG) for causal variables and their roles.•The KG combines literature-derived knowledge with ontology-grounded knowledge.•We design queries to search the KG for confounder, collider, and mediator roles.•KG search reveals variables in these roles for depression and Alzheimer’s disease.Graphical abstract
We introduce an approach to causal modeling that uses Literature-Based Discovery (LBD) to identify salient domain knowledge in observational data. Causal models represent a marriage between graph theory, probability, and domain knowledge. We hypothesize that the LBD paradigm can be applied to identify variables of interest for the automated construction of causal models of observational data, and that causal models thus informed will improve upon the performance of purely statistical techniques. We evaluated our hypothesis with a pharmacovigilance (PV) use case. In PV, the task is to discriminate between drug/side-effect signals and noise. We analyzed observational clinical data derived from electronic health records (EHR) and constructed causal models. We used logistic regression coefficients as our baseline and calculated estimated controlled direct effect from the LBD-informed causal models. Causal models improved upon unadjusted statistical models by 8.64% using Area under the Curve of the Receiver Operating Characteristic. Improving upon previous work in PV using EHR as the primary data source, our results establish the utility of the approach.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.