27!The ability to predict traits from genome-wide sequence information (Genomic 28! Prediction, GP), has improved our understanding of the genetic basis of complex traits and 29! 86! for trait prediction. However, GP-based approaches that trained on the entire transcriptome data 87! have not been used to better understand the genetic mechanisms for a trait. In addition, it is not 88! ! 4! known the degree to which transcriptomes obtained at a particular developmental stage can be 89! informative for predicting phenotypes scored at a different stage. To address these questions, we 90! used transcriptome data derived from maize whole seedling 22 to predict phenotypes (flowering 91! time, height, and grain yield) at much later developmental stages. In addition to comparing 92! prediction performance between genetic marker and transcriptome-based models, we also looked 93! at whether transcripts and genetic marker features important for the prediction models were 94! located in the same or adjacent regions. Finally, we determined how well our models were able 95! to identify a benchmark set of flowering time genes to explore the potential of using GP to better 96! understand the mechanistic basis of complex traits.
97!
98!
Results and Discussion
99!
Relationships between transcript levels, kinship, and phenotypes among maize lines
100!Before using the transcriptome data for GP, we first assessed properties of the 101! transcriptome data in three areas: (1) the quantity and distribution of transcript information 102! across the genome, (2) the amount of variation in transcript levels, and (3) the similarity in the 103! transcriptome profile between maize lines, with an emphasis on how these properties compared 104! to those based on the genotype data. After filtering out 16,898 transcripts that did not map to the 105! B73 reference genome or had zero or near zero variance across lines (see Methods), we had 106! 31,238 transcripts. While the number of transcripts was <10% of the number of genetic markers 107! used in this study (332,178), the distribution of transcripts along the genome was similar to the 108! genetic marker distribution (Fig. S1). The log2-transformed median transcript level across lines 109! ranged from 0 to 12.4 (median=2.2) and the variance ranged from 3x10 -30 to 14.5 (median= 110! 0.13), highlighting that a subset of transcripts had relatively high variation in transcript levels 111! across maize lines at the seedling stage. To determine how similar transcript levels were between 112! lines, we calculated the expression Correlation (eCor) between all pairs of lines using Pearson's 113! Correlation Coefficient (PCC). The eCor values ranged from 0.84 to 0.99 (mean=0.93). As 114! expected, lines with similar transcriptome profiles were also genetically similar as there was a 115! significant correlation between eCor values with values in the kinship matrix generated from the 116! genetic marker data (Spearman's Rank ρ = 0.27, p < 2.2x10 -16 ; Fig. 1A). As a result, we were 117! able to find clust...