2020
DOI: 10.1371/journal.pone.0228070
|View full text |Cite
|
Sign up to set email alerts
|

The number of k-mer matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances

Abstract: We study the number N k of length-k word matches between pairs of evolutionarily related DNA sequences, as a function of k. We show that the Jukes-Cantor distance between two genome sequences-i.e. the number of substitutions per site that occurred since they evolved from their last common ancestor-can be estimated from the slope of a function F that depends on N k and that is affine-linear within a certain range of k. Integers k min and k max can be calculated depending on the length of the input sequences, su… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
29
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
4
3
2

Relationship

2
7

Authors

Journals

citations
Cited by 41 publications
(29 citation statements)
references
References 63 publications
0
29
0
Order By: Relevance
“…From Figure 5, it can be seen that several other popular methods have RF distances very close to K-Phylo. It should be noted that results on this dataset are surprising as methods performing well on the rest of the datasets performed poorly here and vice versa as claimed in [15]. The tree estimated by K-Phylo on this dataset and the benchmark tree are available in Figure S1 of Supplementary Data.…”
Section: Yersinia Strainsmentioning
confidence: 70%
See 1 more Smart Citation
“…From Figure 5, it can be seen that several other popular methods have RF distances very close to K-Phylo. It should be noted that results on this dataset are surprising as methods performing well on the rest of the datasets performed poorly here and vice versa as claimed in [15]. The tree estimated by K-Phylo on this dataset and the benchmark tree are available in Figure S1 of Supplementary Data.…”
Section: Yersinia Strainsmentioning
confidence: 70%
“…The limitation of this selection process is that it is solely dependent on sequence length and does not take into account resemblances between sequences. Another mechanism in [15] explains a method of finding a range of feasible values of k as a Figure 2: First, different k-mers are listed from the input sequences. Then separate binary matrices from all these k-mer counts are produced.…”
Section: Finding An Appropriate K-mer Lengthmentioning
confidence: 99%
“…One of the fundamental tasks of genomics is to compare these sequences for phylogenetic analysis. Several methods are available to compare genetic sequences, either through sequence alignment [1,2] or alignment-free approach [3,4,5,6,7]. Due to large size of genomic data, the sequence alignment approaches are not time and memory e cient and also have some shortcomings [8,9].…”
Section: Introductionmentioning
confidence: 99%
“…In recent years, a large number of alignment-free approaches to phylogeny reconstruction have been developed and applied, since these methods are much faster than traditional, alignment-based phylogenetic methods, see [50,39,3,25] for recent review papers. Most alignment-free approaches are based on k-mer statistics [21,44,7,48,17], but there are also approaches based on the length of common substrings [47,8,27,37,32,46], on word or spaced-word matches [38,33,35,34,1,41] or on so-called micro-alignments [49,20,29,28]. As has been mentioned by various authors, an additional advantage of many alignment-free methods is that they can be applied not only to complete genome sequences, but also to unassembled reads.…”
Section: Introductionmentioning
confidence: 99%