12From publicly available next-gen sequencing datasets of non-model organisms, such as marine protists, arise opportunities to explore their evolutionary relationships. In this study we explored the effects that dataset and model selection have on the phylogenetic inference of the Gonyaulacales, single celled marine algae of the phylum Dinoflagellata with genomes that show extensive paralogy. We developed a method for identifying and extracting single copy genes from RNA-seq libraries and compared phylogenies inferred from these single copy genes with those inferred from commonly used genetic markers and phylogenetic methods. Comparison of two datasets and three different phylogenetic models showed that exclusive use of ribosomal DNA sequences, maximum likelihood and gene concatenation showed very different results to that obtained with the multi-species coalescent. The multi-species coalescent has recently been recognized as being robust to the inclusion of paralogs, including hidden paralogs present in single copy gene sets (pseudoorthologs). Comparisons of model fit strongly favored the multi-species coalescent for these data, over a concatenated alignment (single tree) model. Our findings suggest that the multi-species coalescent (inferred either via Maximum Likelihood or Bayesian Inference) should be considered for future phylogenetic studies of organisms where accurate selection of orthologs is difficult. 33 2011).
34Factors impacting phylogenetic studies range from the computational methods and availability of 35 compute infrastructure, the methods and models applied to the data as well as the accuracy of the initial 36 genetic data set itself. Furthermore, the practitioners themselves need to have a solid understanding of the 37 methods, including their shortcomings. 38 An example of the breadth of publicly available data is the Marine Microbial Eukaryote Transcriptome 39 Sequencing Project (MMETSP), which provides transcriptome sequences of over 650 marine eukaryotic 40 microbes (Keeling et al., 2014). The MMETSP project focuses on a group of understudied organisms 41 which are abundant and play vital roles in the marine environment, from geochemical cycling, to predation, 42to symbiosis (Gómez, 2005(Gómez, , 2012. This data set offers an excellent opportunity to explore the evolutionary 43 relationships between these taxa through phylogenetics.
44Central to phylogenetic inference is the existence of characters (such as nucleotides) derived from a 45 common ancestor, which is called homology (Fitch, 2000). There are several types of homology, each 46 differing in how the characters diverged, and determining the mechanisms through which characters 47 have evolved is essential for choosing the correct inference model. Orthology refers to the case where 48 the divergence of two gene copies has followed a speciation event (Fitch, 1970). Paralogs are two gene 49 copies whose divergence is initiated by gene duplication (Fitch, 1970). Xenologs are genes which, having 50 previously diverged from a common ancest...