Abstract. In the mid-2000s, molecular phylogenetics turned into phylogenomics, a development that improved the resolution of phylogenetic trees through a dramatic reduction in stochastic error. While some then predicted "the end of incongruence", it soon appeared that analysing large amounts of sequence data without an adequate model of sequence evolution amplifies systematic error and leads to phylogenetic artefacts. With the increasing flood of (sometimes low-quality) genomic data resulting from the rise of high-throughput sequencing, a new type of error has emerged. Termed here "data errors", it lumps together several kinds of issues affecting the construction of phylogenomic supermatrices (e.g., sequencing and annotation errors, contaminant sequences). While easy to deal with at a single-gene scale, such errors become very difficult to avoid at the genomic scale, both because hand curating thousands of sequences is prohibitively time-consuming and because the suitable automated bioinformatics tools are still in their infancy. In this paper, we first review the pitfalls affecting the construction of supermatrices European Journal of Taxonomy 283: 1-25 (2017) 2 and the strategies to limit their adverse effects on phylogenomic inference. Then, after discussing the relative non-issue of missing data in supermatrices, we briefly present the approaches commonly used to reduce systematic error.Keywords. Phylogenomics, supermatrix, systematic error, data quality, incongruence.
From phylogenetics to phylogenomicsThe last two decades have seen significant changes taking place in the practice of phylogenetic inference from molecular data. These changes have mostly been triggered by the ever-increasing size of the datasets being assembled and analysed in the form of supermatrices of concatenated genes (Driskell et al. 2004). The evolution of the size and shape of these datasets, thanks to new technological advances in DNA amplification and sequencing, has been associated with different phases in the field of phylogenetic inference ( Fig. 1). In the early days of molecular phylogenetics based on PCR amplification and manual Sanger sequencing of a handful of genes for a limited number of taxa, the field laid somewhat in the "uncertainty zone" dominated by stochastic error, even if some naïve over-interpretations did occur. Then, the development of automated Sanger sequencing typically led to a few standard genes (e.g., SSU rRNA, elongation factors, RNA polymerases) being sequenced for a large number of taxa, projecting the field into the "irresolution zone" with only limited information available to resolve a large number of nodes. Yet, at that time the focus was not on resolving large-scale phylogenetic relationships, rather than on identifying species or strains by molecular means, which eventually gave rise to the field of DNA barcoding (Hebert et al. 2003). The first genome sequences from model organisms reverted the situation, with few taxa being sequenced for their entire set of genes causing a sort of "inconsis...