Abstract.-Advances in high-throughput sequencing techniques now allow relatively easy 20 and affordable sequencing of large portions of the genome, even for non-model organisms.
21Many phylogenetic studies prefer to reduce costs by focusing their sequencing efforts on a 22 selected set of targeted loci, commonly enriched using sequence capture. The advantage of 23 this approach is that it recovers a consistent set of loci, each with high sequencing depth, 24 which leads to more confidence in the assembly of target sequences. High sequencing depth 25 can also be used to identify phylogenetically informative allelic variation within sequenced 26 individuals, but allele sequences are infrequently assembled in phylogenetic studies.
27Instead, many scientists perform their phylogenetic analyses using contig sequences which 28 result from the de novo assembly of sequencing reads into contigs containing only canonical 29 nucleobases, and this may reduce both statistical power and phylogenetic accuracy. Here,
30we develop an easy-to-use pipeline to recover allele sequences from sequence capture data,
31and we use simulated and empirical data to demonstrate the utility of integrating these 32 allele sequences to analyses performed under the Multispecies Coalescent (MSC) model.
33Our empirical analyses of Ultraconserved Element (UCE) locus data collected from the 34 South American hummingbird genus Topaza demonstrate that phased allele sequences carry 35 sufficient phylogenetic information to infer the genetic structure, lineage divergence, and 36 biogeographic history of a genus that diversified during the last three million years, support 37 the recognition of two species, and suggest a high rate of gene flow across large distances of 38 rainforest habitats but rare admixture across the Amazon River. Our simulations show 39 that analyzing allele sequences leads to more accurate estimates of tree topology and 40 divergence times than the more common approach of using contig sequences. We conclude 41 that allele phasing may be the most appropriate processing scheme for phylogenetic 42 analyses of UCE data in particular, and sequence capture data, more generally. (Fig. 4). Hereafter, we use "contigs" and "contig 61 sequences" to refer to the sequences that are output by de novo assemblers.
62One alternative approach to generating contig sequences uses the depth of 29, 2018; estimation of gene trees, species trees, and divergence times (Garrick et al. 2010; Potts 72 et al. 2014; Lischer et al. 2014). The common practice of neglecting allelic information in 73 phylogenetic studies possibly results from historical inertia and a lack of computational 74 pipelines to prepare allele sequences for phylogenetic analysis using MPS data.
CC-BY-ND4.0 International license peer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not . http://dx.doi.org/10.1101/255752 doi: bioRxiv preprint first posted online Jan.
75In addition to the problems of determining allelic se...