We study the number N k of length-k word matches between pairs of evolutionarily related DNA sequences, as a function of k. We show that the Jukes-Cantor distance between two genome sequences-i.e. the number of substitutions per site that occurred since they evolved from their last common ancestor-can be estimated from the slope of a function F that depends on N k and that is affine-linear within a certain range of k. Integers k min and k max can be calculated depending on the length of the input sequences, such that the slope of F in the relevant range can be estimated from the values F(k min ) and F(k max ). This approach can be generalized to so-called Spaced-word Matches (SpaM), where mismatches are allowed at positions specified by a user-defined binary pattern. Based on these theoretical results, we implemented a prototype software program for alignment-free sequence comparison called Slope-SpaM. Test runs on real and simulated sequence data show that Slope-SpaM can accurately estimate phylogenetic distances for distances up to around 0.5 substitutions per position. The statistical stability of our results is improved if spaced words are used instead of contiguous words. Unlike previous alignment-free methods that are based on the number of (spaced) word matches, Slope-SpaM produces accurate results, even if sequences share only local homologies. OPEN ACCESSCitation: Röhling S, Linne A, Schellhorn J, Hosseini M, Dencker T, Morgenstern B (2020) The number of k-mer matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances. PLoS ONE 15(2): e0228070. https://doi.org/10. Data Availability Statement:The source code of our software is freely available through GitHub Traditionally, phylogenetic distances are inferred from pairwise or multiple sequence alignments. For the huge amounts of sequence data that are now available, however, sequence alignment has become too slow. Therefore, considerable efforts have been made in recent years, to develop fast alignment-free approaches that can estimate phylogenetic distances without the need to calculate full alignments of the input sequences, see [3][4][5][6][7] for recent review articles. Alignment-free approaches are not only used in phylogeny reconstruction, but are also important in metagenomics [8][9][10], to find genome rearrangements [11] and in epidemiology [12] and other medical applications, for example to identify drug-resistant bacteria [13] or to classify viruses [14,15]. In all these applications, it is crucial to rapidly estimate pairwise similarity or dissimilarity values in large sets of sequence data.Some alignment-free approaches are based on word frequencies [16,17] or on the length of common substrings [18][19][20]. Other methods use variants of the D 2 distance which is defined as the number of word matches of a pre-defined length between two sequences [15,[21][22][23]; a review focusing on these methods is given in [24]. kWIP [25] is a further development of this concept that uses information-theoretical...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.