The popularization of biobanks provides an unprecedented amount of genetic and phenotypic information that can be used to research the relationship between genetics and human health. Despite the opportunities these datasets provide, they also pose many problems associated with computational time and costs, data size and transfer, and privacy and security. The publishing of summary statistics from these biobanks, and the use of them in a variety of downstream statistical analyses, alleviates many of these logistical problems. However, major questions remain about how to use summary statistics in all but the simplest downstream applications. Here, we present a novel approach to utilize basic summary statistics (estimates from single marker regressions on single phenotypes) to evaluate more complex phenotypes using multivariate methods. In particular, we present a covariate-adjusted method for conducting principal component analysis (PCA) utilizing only biobank summary statistics. We validate exact formulas for this method, as well as provide a framework of estimation when specific summary statistics are not available, through simulation. We apply our method to a real data set of fatty acid and genomic data.
Rhizomes facilitate the wintering and vegetative propagation of many perennial grasses. Sorghum halepense (johnsongrass) is an aggressive perennial grass that relies on a robust rhizome system to persist through winters and reproduce asexually from its rootstock nodes. This study aimed to sequence and assemble expressed transcripts within the johnsongrass rhizome. A de novo transcriptome assembly was generated from a single johnsongrass rhizome meristem tissue sample. A total of 141,176 probable protein-coding sequences from the assembly were identified and assigned gene ontology terms using Blast2GO. Estimated expression analysis and BLAST results were used to reduce the assembly to 64,447 high-confidence sequences. The johnsongrass assembly was compared to Sorghum bicolor, a related nonrhizomatous species, along with an assembly of similar rhizome tissue from the perennial grain crop Thinopyrum intermedium. The presence/absence analysis yielded a set of 98 expressed johnsongrass contigs that are likely associated with rhizome development.
In the search for an understanding of how genetic variation contributes to the heritability of common human disease, the potential role of epigenetic factors, such as methylation, is being explored with increasing frequency. Although standard analyses test for associations between methylation levels at individual cytosine-phosphate-guanine (CpG) sites and phenotypes of interest, some investigators have begun testing for methylation and how methylation may modulate the effects of genetic polymorphisms on phenotypes. In our analysis, we used both a genome-wide and candidate gene approach to investigate potential single-nucleotide polymorphism (SNP)–CpG interactions on changes in triglyceride levels. Although we were able to identify numerous loci of interest when using an exploratory significance threshold, we did not identify any significant interactions using a strict genome-wide significance threshold. We were also able to identify numerous loci using the candidate gene approach, in which we focused on 18 genes with prior evidence of association of triglyceride levels. In particular, we identified GALNT2 loci as containing potential CpG sites that moderate the impact of genetic polymorphisms on triglyceride levels. Further work is needed to provide clear guidance on analytic strategies for testing SNP–CpG interactions, although leveraging prior biological understanding may be needed to improve statistical power in data sets with smaller sample sizes.
Although methylation data continues to rise in popularity, much is still unknown about how to best analyze methylation data in genome-wide analysis contexts. Given continuing interest in gene-based tests for next-generation sequencing data, we evaluated the performance of novel gene-based test statistics on simulated data from GAW20. Our analysis suggests that most of the gene-based tests are detecting real signals and maintaining the Type I error rate. The minimum p value and threshold-based tests performed well compared to single-marker tests in many cases, especially when the number of variants was relatively large with few true causal variants in the set.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.