Objective Electronic health records (EHRs) are a rich source of information on human diseases, but the information is variably structured, fragmented, curated using different coding systems, and collected for purposes other than medical research. We describe an approach for developing, validating, and sharing reproducible phenotypes from national structured EHR in the United Kingdom with applications for translational research. Materials and Methods We implemented a rule-based phenotyping framework, with up to 6 approaches of validation. We applied our framework to a sample of 15 million individuals in a national EHR data source (population-based primary care, all ages) linked to hospitalization and death records in England. Data comprised continuous measurements (for example, blood pressure; medication information; coded diagnoses, symptoms, procedures, and referrals), recorded using 5 controlled clinical terminologies: (1) read (primary care, subset of SNOMED-CT [Systematized Nomenclature of Medicine Clinical Terms]), (2) International Classification of Diseases–Ninth Revision and Tenth Revision (secondary care diagnoses and cause of mortality), (3) Office of Population Censuses and Surveys Classification of Surgical Operations and Procedures, Fourth Revision (hospital surgical procedures), and (4) DM+D prescription codes. Results Using the CALIBER phenotyping framework, we created algorithms for 51 diseases, syndromes, biomarkers, and lifestyle risk factors and provide up to 6 validation approaches. The EHR phenotypes are curated in the open-access CALIBER Portal (https://www.caliberresearch.org/portal) and have been used by 40 national and international research groups in 60 peer-reviewed publications. Conclusions We describe a UK EHR phenomics approach within the CALIBER EHR data platform with initial evidence of validity and use, as an important step toward international use of UK EHR data for health research.
Aims While most patients with myocardial infarction (MI) have underlying coronary atherosclerosis, not all patients with coronary artery disease (CAD) develop MI. We sought to address the hypothesis that some of the genetic factors which establish atherosclerosis may be distinct from those that predispose to vulnerable plaques and thrombus formation. Methods and results We carried out a genome-wide association study for MI in the UK Biobank (n∼472 000), followed by a meta-analysis with summary statistics from the CARDIoGRAMplusC4D Consortium (n∼167 000). Multiple independent replication analyses and functional approaches were used to prioritize loci and evaluate positional candidate genes. Eight novel regions were identified for MI at the genome wide significance level, of which effect sizes at six loci were more robust for MI than for CAD without the presence of MI. Confirmatory evidence for association of a locus on chromosome 1p21.3 harbouring choline-like transporter 3 (SLC44A3) with MI in the context of CAD, but not with coronary atherosclerosis itself, was obtained in Biobank Japan (n∼165 000) and 16 independent angiography-based cohorts (n∼27 000). Follow-up analyses did not reveal association of the SLC44A3 locus with CAD risk factors, biomarkers of coagulation, other thrombotic diseases, or plasma levels of a broad array of metabolites, including choline, trimethylamine N-oxide, and betaine. However, aortic expression of SLC44A3 was increased in carriers of the MI risk allele at chromosome 1p21.3, increased in ischaemic (vs. non-diseased) coronary arteries, up-regulated in human aortic endothelial cells treated with interleukin-1β (vs. vehicle), and associated with smooth muscle cell migration in vitro. Conclusions A large-scale analysis comprising ∼831 000 subjects revealed novel genetic determinants of MI and implicated SLC44A3 in the pathophysiology of vulnerable plaques.
Mendelian randomization (MR) is increasingly used to make causal inferences in a wide range of fields, from drug development to etiologic studies. Causal inference in MR is possible because of the process of genetic inheritance from parents to offspring. Specifically, at gamete formation and conception, meiosis ensures random allocation to the offspring of one allele from each parent at each locus, and these are unrelated to most of the other inherited genetic variants. To date, most MR studies have used data from unrelated individuals. These studies assume that genotypes are independent of the environment across a sample of unrelated individuals, conditional on covariates. Here we describe potential sources of bias, such as transmission ratio distortion, selection bias, population stratification, dynastic effects and assortative mating that can induce spurious or biased SNP–phenotype associations. We explain how studies of related individuals such as sibling pairs or parent–offspring trios can be used to overcome some of these sources of bias, to provide potentially more reliable evidence regarding causal processes. The increasing availability of data from related individuals in large cohort studies presents an opportunity to both overcome some of these biases and also to evaluate familial environmental effects.
Replicable genetic association signals have consistently been found through genome-wide association studies in recent years. The recent dramatic expansion of study sizes improves power of estimation of effect sizes, genomic prediction, causal inference, and polygenic selection, but it simultaneously increases susceptibility of these methods to bias due to subtle population structure. Standard methods using genetic principal components to correct for structure might not always be appropriate and we use a simulation study to illustrate when correction might be ineffective for avoiding biases. New methods such as trans-ethnic modeling and chromosome painting allow for a richer understanding of the relationship between traits and population structure. We illustrate the arguments using real examples (stroke and educational attainment) and provide a more nuanced understanding of population structure, which is set to be revisited as a critical aspect of future analyses in genetic epidemiology. We also make simple recommendations for how problems can be avoided in the future. Our results have particular importance for the implementation of GWAS meta-analysis, for prediction of traits, and for causal inference.
Estimates from genome-wide association studies (GWAS) represent a combination of the effect of inherited genetic variation (direct effects), demography (population stratification, assortative mating) and genetic nurture from relatives (indirect genetic effects). GWAS using family-based designs can control for demography and indirect genetic effects, but large-scale family datasets have been lacking. We combined data on 159,701 siblings from 17 cohorts to generate population (between-family) and within-sibship (within-family) estimates of genome-wide genetic associations for 25 phenotypes. We demonstrate that existing GWAS associations for height, educational attainment, smoking, depressive symptoms, age at first birth and cognitive ability overestimate direct effects. We show that estimates of SNP-heritability, genetic correlations and Mendelian randomization involving these phenotypes substantially differ when calculated using within-sibship estimates. For example, genetic correlations between educational attainment and height largely disappear. In contrast, analyses of most clinical phenotypes (e.g. LDL-cholesterol) were generally consistent between population and within-sibship models. We also report compelling evidence of polygenic adaptation on taller human height using within-sibship data. Large-scale family datasets provide new opportunities to quantify direct effects of genetic variation on human traits and diseases.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.