SignificanceIn plants, DNA cytosine methylation plays a central role in diverse cellular functions, from transcriptional regulation to maintenance of genome integrity. Vast numbers of whole-genome bisulphite sequencing (WGBS) datasets have been generated to profile DNA methylation at single-nucleotide resolution, yet computational analyses vary widely among research groups, making it difficult to cross-compare findings. Here we reprocessed hundreds of publicly available Arabidopsis WGBS libraries using a uniform pipeline. We identified high-confidence differentially methylated regions and compared libraries using a hierarchical framework, allowing us to identify relationships between methylation pathways. Furthermore, by using a large number of independent wild-type controls, we effectively filtered out spontaneous methylation changes from those that are biologically meaningful.
Mediator 12 (MED12) and MED13 are components of the Mediator multi-protein complex, that facilitates the initial steps of gene transcription. Here, in an Arabidopsis mutant screen, we identify MED12 and MED13 as positive gene regulators, both of which contribute broadly to morc1 de-repressed gene expression. Both MED12 and MED13 are preferentially required for the expression of genes depleted in active chromatin marks, a chromatin signature shared with morc1 reactivated loci. We further discover that MED12 tends to interact with genes that are responsive to environmental stimuli, including light and radiation. We demonstrate that light-induced transient gene expression depends on MED12, and is accompanied by a concomitant increase in MED12 enrichment during induction. In contrast, the steady-state expression level of these genes show little dependence on MED12, suggesting that MED12 is primarily required to aid the expression of genes in transition from less-active to more active states.
Next-generation sequencing technology (NGS) enables the discovery of nearly all genetic variants present in a genome. A subset of these variants, however, may have poor sequencing quality due to limitations in NGS or variant callers. In genetic studies that analyze a large number of sequenced individuals, it is critical to detect and remove those variants with poor quality as they may cause spurious findings. In this paper, we present ForestQC, a statistical tool for performing quality control on variants identified from NGS data by combining a traditional filtering approach and a machine learning approach. Our software uses the information on sequencing quality, such as sequencing depth, genotyping quality, and GC contents, to predict whether a particular variant is likely to be false-positive. To evaluate ForestQC, we applied it to two whole-genome sequencing datasets where one dataset consists of related individuals from families while the other consists of unrelated individuals. Results indicate that ForestQC outperforms widely used methods for performing quality control on variants such as VQSR of GATK by considerably improving the quality of variants to be included in the analysis. ForestQC is also very efficient, and hence can be applied to large sequencing datasets. We conclude that combining a machine learning algorithm trained with sequencing quality information and the filtering approach is a practical approach to perform quality control on genetic variants from sequencing data.
Background Large medical centers in urban areas, like Los Angeles, care for a diverse patient population and offer the potential to study the interplay between genetic ancestry and social determinants of health. Here, we explore the implications of genetic ancestry within the University of California, Los Angeles (UCLA) ATLAS Community Health Initiative—an ancestrally diverse biobank of genomic data linked with de-identified electronic health records (EHRs) of UCLA Health patients (N=36,736). Methods We quantify the extensive continental and subcontinental genetic diversity within the ATLAS data through principal component analysis, identity-by-descent, and genetic admixture. We assess the relationship between genetically inferred ancestry (GIA) and >1500 EHR-derived phenotypes (phecodes). Finally, we demonstrate the utility of genetic data linked with EHR to perform ancestry-specific and multi-ancestry genome and phenome-wide scans across a broad set of disease phenotypes. Results We identify 5 continental-scale GIA clusters including European American (EA), African American (AA), Hispanic Latino American (HL), South Asian American (SAA) and East Asian American (EAA) individuals and 7 subcontinental GIA clusters within the EAA GIA corresponding to Chinese American, Vietnamese American, and Japanese American individuals. Although we broadly find that self-identified race/ethnicity (SIRE) is highly correlated with GIA, we still observe marked differences between the two, emphasizing that the populations defined by these two criteria are not analogous. We find a total of 259 significant associations between continental GIA and phecodes even after accounting for individuals’ SIRE, demonstrating that for some phenotypes, GIA provides information not already captured by SIRE. GWAS identifies significant associations for liver disease in the 22q13.31 locus across the HL and EAA GIA groups (HL p-value=2.32×10−16, EAA p-value=6.73×10−11). A subsequent PheWAS at the top SNP reveals significant associations with neurologic and neoplastic phenotypes specifically within the HL GIA group. Conclusions Overall, our results explore the interplay between SIRE and GIA within a disease context and underscore the utility of studying the genomes of diverse individuals through biobank-scale genotyping linked with EHR-based phenotyping.
Large medical centers located in urban areas such as Los Angeles care for a diverse patient population and offer the potential to study the interplay between genomic ancestry and social determinants of health within a single medical system. Here, we introduce the UCLA ATLAS Community Health Initiative-- a biobank of genomic data linked with de-identified electronic health records (EHRs) of UCLA Health patients. We leverage the unique genomic diversity of the patient population in ATLAS to explore the interplay between self-reported race/ethnicity and genetic ancestry within a disease context using phenotypes extracted from the EHR. First, we identify an extensive amount of continental and subcontinental genomic diversity within the ATLAS data that is consistent with the global diversity of Los Angeles; this includes clusters of ATLAS individuals corresponding to individuals with Korean, Japanese, Filipino, and Middle Eastern genomic ancestries. Most importantly, we find that common diseases and traits stratify across genomic ancestry clusters, thus suggesting their utility in understanding disease biology across diverse individuals. Next, we showcase the power of genetic data linked with EHR to perform ancestry-specific genome and phenome-wide scans to identify genetic factors for a variety of EHR-derived phenotypes (phecodes). For example, we find ancestry-specific associations for liver disease, and link the genetic variants with neurological and neoplastic phenotypes primarily within individuals of admixed ancestries. Overall, our results underscore the utility of studying the genomes of diverse individuals through biobank-scale genotyping efforts linked with EHR-based phenotyping.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.