Biographical noteGuanjing Hu (https://orcid.org/0000-0001-8552-7394) is a researcher in evolutionary genomics, functional genomics, and whole-genome duplication (polyploidization).Corrinne E. Grover (https://orcid.org/0000-0003-3878-5459) is a researcher in evolutionary genomics, molecular Evolution, and plant Systematics and Evolution. 0000-0003-3878-5459 Mark A. Arick II (https://orcid.org/0000-0002-7207-3052) is a researcher in genomics, bioinformatics, and computational sciences.Meiling Liu (https://orcid.org/0000-0001-7953-1506) is a researcher in statistics and bioinformatics.Daniel G. Peterson (https://orcid.org/0000-0002-0274-5968) is a researcher in genomics, biocomputing, and biotechnology. Jonathan F. Wendel (https://orcid.org/0000-0003-2258-5081) is a researcher in evolutionary genomics, molecular Evolution, and plant Systematics and Evolution.
ABSTRACTPolyploidy is a widespread phenomenon throughout eukaryotes. Due to the coexistence of duplicated genomes, polyploids offer unique challenges for estimating gene expression levels, which is essential for understanding the massive and various forms of transcriptomic responses accompanying polyploidy. Although previous studies have explored the bioinformatics of polyploid transcriptomic profiling, the causes and consequences of inaccurate quantification of transcripts from duplicated gene copies have not been addressed. Using transcriptomic data from the cotton genus (Gossypium) as an example, we present an analytical workflow to evaluate a variety of bioinformatic method choices at different stages of RNA-seq analysis, from homoeolog expression quantification to downstream analysis used to infer key phenomena of polyploid expression evolution. In general, GSNAP-PolyCat outperforms other quantification pipelines tested, and its derived expression dataset best represents the expected homoeolog expression and co-expression divergence. The performance of co-expression network analysis was less affected by homoeolog quantification than by network construction methods, where weighted networks outperformed binary networks. By examining the extent and consequences of homoeolog read ambiguity, we illuminate the potential artifacts that may affect our understanding of duplicate gene expression, including an over-estimation of homoeolog coregulation and the incorrect inference of subgenome asymmetry in network topology. Taken together, our work points to a set of reasonable practices that we hope are broadly applicable to the evolutionary exploration of polyploids.
RNA-seq read mapping and homoeolog-specific read partitioningThe following five pipelines were each independently applied to the diploid and AD polyploid datasets.GSNAP-PolyCat. This pipeline utilizes the SNP-tolerant capabilities of GSNAP [v2016-08-16] [29] to map polyploid reads to a single diploid progenitor genome (here, G. raimondii; [30]). The SNP-tolerant feature of GSNAP permits equivocal mapping of both A-and D-diploid derived reads based on a priori SNP information. Here, we used a previousl...