This is a PDF file of a peer-reviewed paper that has been accepted for publication. Although unedited, the content has been subjected to preliminary formatting. Nature is providing this early version of the typeset paper as a service to our authors and readers. The text and figures will undergo copyediting and a proof review before the paper is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers apply.
The UK Biobank is a prospective study of 502,543 individuals, combining extensive phenotypic and genotypic data with streamlined access for researchers around the world1. Here we describe the release of exome-sequence data for the first 49,960 study participants, revealing approximately 4 million coding variants (of which around 98.6% have a frequency of less than 1%). The data include 198,269 autosomal predicted loss-of-function (LOF) variants, a more than 14-fold increase compared to the imputed sequence. Nearly all genes (more than 97%) had at least one carrier with a LOF variant, and most genes (more than 69%) had at least ten carriers with a LOF variant. We illustrate the power of characterizing LOF variants in this population through association analyses across 1,730 phenotypes. In addition to replicating established associations, we found novel LOF variants with large effects on disease traits, including PIEZO1 on varicose veins, COL6A1 on corneal resistance, MEPE on bone density, and IQGAP2 and GMPR on blood cell traits. We further demonstrate the value of exome sequencing by surveying the prevalence of pathogenic variants of clinical importance, and show that 2% of this population has a medically actionable variant. Furthermore, we characterize the penetrance of cancer in carriers of pathogenic BRCA1 and BRCA2 variants. Exome sequences from the first 49,960 participants highlight the promise of genome sequencing in large population-based studies and are now accessible to the scientific community.
SUMMARYThe UK Biobank is a prospective study of 502,543 individuals, combining extensive phenotypic and genotypic data with streamlined access for researchers around the world. Here we describe the first tranche of large-scale exome sequence data for 49,960 study participants, revealing approximately 4 million coding variants (of which ~98.4% have frequency < 1%). The data includes 231,631 predicted loss of function variants, a >10-fold increase compared to imputed sequence for the same participants. Nearly all genes (>97%) had ≥1 predicted loss of function carrier, and most genes (>69%) had ≥10 loss of function carriers. We illustrate the power of characterizing loss of function variation in this large population through association analyses across 1,741 phenotypes. In addition to replicating a range of established associations, we discover novel loss of function variants with large effects on disease traits, including PIEZO1 on varicose veins, COL6A1 on corneal resistance, MEPE on bone density, and IQGAP2 and GMPR on blood cell traits. We further demonstrate the value of exome sequencing by surveying the prevalence of pathogenic variants of clinical significance in this population, finding that 2% of the population has a medically actionable variant. Additionally, we leverage the phenotypic data to characterize the relationship between rare BRCA1 and BRCA2 pathogenic variants and cancer risk. Exomes from the first 49,960 participants are now made accessible to the scientific community and highlight the promise offered by genomic sequencing in large-scale population-based studies.
Transcription regulation in eukaryotes is known to occur through the coordinated action of multiple transcription factors (TFs). Recently, a few genome-wide transcription studies have begun to explore the combinatorial nature of TF interactions. We propose a novel approach that reveals how multiple TFs cooperate to regulate transcription in the yeast cell cycle. Our method integrates genome-wide gene expression data and chromatin immunoprecipitation (ChIP-chip) data to discover more biologically relevant synergistic interactions between different TFs and their target genes than previous studies. Given any pair of TFs A and B, we define a novel measure of cooperativity between the two TFs based on the expression patterns of sets of target genes of only A, only B, and both A and B. If the cooperativity measure is significant then there is reason to postulate that the presence of both TFs is needed to influence gene expression. Our results indicate that many cooperative TFs that were previously characterized experimentally indeed have high values of cooperativity measures in our analysis. In addition, we propose several novel, experimentally testable predictions of cooperative TFs that play a role in the cell cycle and other biological processes. Many of them hold interesting clues for cross talk between the cell cycle and other processes including metabolism, stress response and pseudohyphal differentiation. Finally, we have created a web tool where researchers can explore the exhaustive list of cooperative TFs and survey the graphical representation of the target genes' expression profiles. The interface includes a tool to dynamically draw a TF cooperativity network of 113 TFs with user-defined significance levels. This study is an example of how systematic combination of diverse data types along with new functional genomic approaches can provide a rigorous platform to map TF interactions more efficiently.
Cooperativity between transcription factors is critical to gene regulation. Current computational methods do not take adequate account of this salient aspect. To address this issue, we present a computational method based on multivariate adaptive regression splines to correlate the occurrences of transcription factor binding motifs in the promoter DNA and their interactions to the logarithm of the ratio of gene expression levels. This allows us to discover both the individual motifs and synergistic pairs of motifs that are most likely to be functional, and enumerate their relative contributions at any arbitrary time point for which mRNA expression data are available. We present results of simulations and focus specifically on the yeast cell-cycle data. Inclusion of synergistic interactions can increase the prediction accuracy over linear regression to as much as 1.5-to 3.5-fold. Significant motifs and combinations of motifs are appropriately predicted at each stage of the cell cycle. We believe our multivariate adaptive regression splines-based approach will become more significant when applied to higher eukaryotes, especially mammals, where cooperative control of gene regulation is absolutely essential.cooperativity ͉ correlation ͉ expression data ͉ transcription regulation R egulation of gene transcription in eukaryotes is complex and is inherently combinatorial in nature (1, 2). Transcriptional synergy is a key element of such combinatorial control in gene regulation networks. It requires cooperative binding of multiple transcription factors (TFs) and is intrinsically nonlinear in nature (2). Taking adequate account of such synergy in computational models is extremely important to have an accurate view of the underlying biology.Conventional computational methods (3) have focused on identifying motifs upstream of the clusters of coexpressed genes. However, many genes fail to cluster and, therefore, regulatory elements of a large number of genes are unknown. Recent work (4, 5) has attempted to overcome this problem by correlating the frequency of DNA motifs with the logarithm of expression levels by using multivariate linear regression. Despite the success in identifying many known important motifs, this method does not account for the synergistic effects and nonlinearities present during transcription regulation. When applied to the yeast cell-cycle data, we found that these methods can explain only 10% of the variations in the data on an average (noise level accounts for Ϸ50%; ref. 4).More recently, models that account for cooperativity between TFs during transcription regulation have been developed (6-10). However, all of these models are limited by one or more of the following factors. Some of these methods (6-8), like expression coherence (EC) score approach (6, 7), require data from multiple time points, which are not always available. Methods based on regression trees (8), on the other hand, cannot take proper account of additive effects. In other cases (9, 10), we found either the known pairs of motifs ...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.