2017
DOI: 10.1089/cmb.2015.0216
|View full text |Cite
|
Sign up to set email alerts
|

Statistically Consistent k-mer Methods for Phylogenetic Tree Reconstruction

Abstract: Frequencies of k-mers in sequences are sometimes used as a basis for inferring phylogenetic trees without first obtaining a multiple sequence alignment. We show that a standard approach of using the squared Euclidean distance between k-mer vectors to approximate a tree metric can be statistically inconsistent. To remedy this, we derive model-based distance corrections for orthologous sequences without gaps, which lead to consistent tree inference. The identifiability of model parameters from k-mer frequencies … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

2
17
0

Year Published

2017
2017
2022
2022

Publication Types

Select...
8

Relationship

1
7

Authors

Journals

citations
Cited by 24 publications
(19 citation statements)
references
References 22 publications
2
17
0
Order By: Relevance
“…In the simplest case, each pairwise measure can be transformed into a distance, and a matrix of such distances used as input for computing a distance tree, e.g. by neighbour-joining 20 21 22 . Evidence is accumulating that in phylogenetic inference per se , these alignment-free methods can offer acceptable performance – in certain cases better than approaches based on multiple sequence alignment – at much greater computational speed and scalability 19 .…”
mentioning
confidence: 99%
“…In the simplest case, each pairwise measure can be transformed into a distance, and a matrix of such distances used as input for computing a distance tree, e.g. by neighbour-joining 20 21 22 . Evidence is accumulating that in phylogenetic inference per se , these alignment-free methods can offer acceptable performance – in certain cases better than approaches based on multiple sequence alignment – at much greater computational speed and scalability 19 .…”
mentioning
confidence: 99%
“…An inner product between k -mer counts has long been used to detect and measure sequence similarity, and is referred to as the D 2 statistic. There have been many derivatives of the D 2 statistic that seek to enhance its accuracy in recreating evolutionary histories (e.g., , and [ 20 22 ]). does not attempt to re-create evolutionary histories, but rather estimates the similarity of genetic material as it exists today.…”
Section: Discussionmentioning
confidence: 99%
“…The normalized frequency of each kind of k-mer consists of the k-mer encoding. It is widely used in genome assembly, clustering and capturing nucleotide or protein sequences' features [26][27][28]. 3) PseDNC/PseKNC:compared with k-mer, PseDNC/PseKNC considered more global information and introduced the physical and chemical properties of nucleotides.…”
Section: Sequence Encoding Strategymentioning
confidence: 99%