One of the major tasks of evolutionary biology is the reconstruction of phylogenetic trees from molecular data. This problem is of critical importance in almost all areas of biology and has a very clear mathematical formulation. The evolutionary model is given by a Markov chain on the true evolutionary tree. Given samples from this Markov chain at the leaves of the tree, the goal is to reconstruct the evolutionary tree. It is crucial to minimize the number of samples, i.e., the length of genetic sequences, as it is constrained by the underlying biology, the price of sequencing etc.It is well known that in order to reconstruct a tree on n leaves, sequences of length Ω(log n) are needed. It was conjectured by M. Steel that for the CFN evolutionary model, if the mutation probability on all edges of the tree is less than p * = ( √ 2 −1)/2 3/2 than the tree can be recovered from sequences of length O(log n). This was proven by the second author in the special case where the tree is "balanced". The second author also proved that if all edges have mutation probability larger than p * then the length needed is n Ω(1) . This "phasetransition " in the number of samples needed is closely related to the phase transition for the reconstruction problem (or extremality of free measure) studied extensively in statistical physics and probability.Here we complete the proof of Steel's conjecture and give a reconstruction algorithm using optimal (up to a multiplicative constant) sequence length. Our results further extend to obtain optimal reconstruction algorithm for the Jukes-Cantor model with short edges. All reconstruction algorithms run in time polynomial in the sequence length.The algorithm and the proofs are based on a novel combination of combinatorial, metric and probabilistic arguments.Keywords optimal phylogenetic reconstruction, mutation probability, second author, markov chain, phylogenetic tree, underlying biology, special case, statistical physic, phase transition, reconstruction problem, evolutionary tree, genetic sequence, molecular data, cfn evolutionary model, evolutionary model, clear mathematical formulation, true evolutionary tree, major task, evolutionary biology, critical importance, free measure
Disciplines
Statistics and Probability | Theory and Algorithms
AbstractOne of the major tasks of evolutionary biology is the reconstruction of phylogenetic trees from molecular data. This problem is of critical importance in almost all areas of biology and has a very clear mathematical formulation. The evolutionary model is given by a Markov chain on the true evolutionary tree. Given samples from this Markov chain at the leaves of the tree, the goal is to reconstruct the evolutionary tree. It is crucial to minimize the number of samples, i.e., the length of genetic sequences, as it is constrained by the underlying biology, the price of sequencing etc.It is well known that in order to reconstruct a tree on n leaves, sequences of length Ω(log n) are needed. It was conjectured by M. Steel that for the CFN evol...