Obstacles to inferring species trees from whole genome data sets range from algorithmic
and data management challenges to the wholesale discordance in evolutionary history found
in different parts of a genome. Recent work that builds trees directly from genomes by
parsing them into sets of small \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{upgreek}
\usepackage{mathrsfs}
\setlength{\oddsidemargin}{-69pt}
\begin{document}
}{}$k$\end{document}-mer strings holds promise
to streamline and simplify these efforts, but existing approaches do not account well for
gene tree discordance. We describe a “seed and extend” protocol that finds nearly exact
matching sets of orthologous \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{upgreek}
\usepackage{mathrsfs}
\setlength{\oddsidemargin}{-69pt}
\begin{document}
}{}$k$\end{document}-mers and extends them to construct data
sets that can properly account for genomic heterogeneity. Exploiting an efficient suffix
array data structure, sets of whole genomes can be parsed and converted into phylogenetic
data matrices rapidly, with contiguous blocks of \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{upgreek}
\usepackage{mathrsfs}
\setlength{\oddsidemargin}{-69pt}
\begin{document}
}{}$k$\end{document}-mers
from the same chromosome, gene, or scaffold concatenated as needed. Phylogenetic trees
constructed from highly curated rice genome data and a diverse set of six other eukaryotic
whole genome, transcriptome, and organellar genome data sets recovered trees nearly
identical to published phylogenomic analyses, in a small fraction of the time, and
requiring many fewer parameter choices. Our method’s ability to retain local homology
information was demonstrated by using it to characterize gene tree discordance across the
rice genome, and by its robustness to the high rate of interchromosomal gene transfer
found in several rice species.