ALFRED: A Practical Method for Alignment-Free Distance Computation

Thankachan, Sharma V.; Chockalingam, Sriram P.; Liu, Yong-Chao; Apostolico, Alberto; Aluru, Srinivas

doi:10.1089/cmb.2015.0217

Cited by 26 publications

(18 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Match-length approaches, in contrast, estimate phylogenetic distances from the length of substring matches between two sequences (Comin and Verzotto, 2012; Haubold et al , 2005; Thankachan et al , 2016; Ulitsky et al , 2006). Since the length of exact substring matches between two homologous sequence regions depends on the mismatch frequency, substitution rates can be estimated, in turn, from the average length of exact common substrings (Domazet-Loso and Haubold, 2009).…”

Section: Introductionmentioning

confidence: 99%

Fast and accurate phylogeny reconstruction using filtered spaced-word matches

2017

View full text Add to dashboard Cite

MotivationWord-based or ‘alignment-free’ algorithms are increasingly used for phylogeny reconstruction and genome comparison, since they are much faster than traditional approaches that are based on full sequence alignments. Existing alignment-free programs, however, are less accurate than alignment-based methods.ResultsWe propose Filtered Spaced Word Matches (FSWM), a fast alignment-free approach to estimate phylogenetic distances between large genomic sequences. For a pre-defined binary pattern of match and don’t-care positions, FSWM rapidly identifies spaced word-matches between input sequences, i.e. gap-free local alignments with matching nucleotides at the match positions and with mismatches allowed at the don’t-care positions. We then estimate the number of nucleotide substitutions per site by considering the nucleotides aligned at the don’t-care positions of the identified spaced-word matches. To reduce the noise from spurious random matches, we use a filtering procedure where we discard all spaced-word matches for which the overall similarity between the aligned segments is below a threshold. We show that our approach can accurately estimate substitution frequencies even for distantly related sequences that cannot be analyzed with existing alignment-free methods; phylogenetic trees constructed with FSWM distances are of high quality. A program run on a pair of eukaryotic genomes of a few hundred Mb each takes a few minutes.Availability and ImplementationThe program source code for FSWM including a documentation, as well as the software that we used to generate artificial genome sequences are freely available at http://fswm.gobics.de/Supplementary information Supplementary data are available at Bioinformatics online.

show abstract

Section: Introductionmentioning

confidence: 99%

Fast and accurate phylogeny reconstruction using filtered spaced-word matches

2017

View full text Add to dashboard Cite

show abstract

“…The algorithm is much more complicated than the original ACS method and even the k-ACS approximation by [14]. Moreover the practical variant of this algorithm can get quite slow for even moderately large values of k due to its exponential dependency on k [21]. However, this algorithm has its merit as the first sub-quadratic time algorithm for exact k-ACS computation for any positive integer k .…”

Section: Introductionmentioning

confidence: 99%

A greedy alignment-free distance estimator for phylogenetic inference

et al. 2017

Self Cite

View full text Add to dashboard Cite

BackgroundAlignment-free sequence comparison approaches have been garnering increasing interest in various data- and compute-intensive applications such as phylogenetic inference for large-scale sequences. While k-mer based methods are predominantly used in real applications, the average common substring (ACS) approach is emerging as one of the prominent alignment-free approaches. This ACS approach has been further generalized by some recent work, either greedily or exactly, by allowing a bounded number of mismatches in the common substrings.ResultsWe present ALFRED-G, a greedy alignment-free distance estimator for phylogenetic tree reconstruction based on the concept of the generalized ACS approach. In this algorithm, we have investigated a new heuristic to efficiently compute the lengths of common strings with mismatches allowed, and have further applied this heuristic to phylogeny reconstruction. Performance evaluation using real sequence datasets shows that our heuristic is able to reconstruct comparable, or even more accurate, phylogenetic tree topologies than the kmacs heuristic algorithm at highly competitive speed.ConclusionsALFRED-G is an alignment-free heuristic for evolutionary distance estimation between two biological sequences. This algorithm is implemented in C++ and has been incorporated into our open-source ALFRED software package (http://alurulab.cc.gatech.edu/phylo).

show abstract

“…For k = 0 kmacs exactly computes the ACS. Other algorithms besides kmacs [33,29] have been designed to compute alignment-free distances based on longest matches with mismatches, but for the special case k = 0 kmacs 332 Table 3. The first collection contains 932 genomes, the second one contains 4, 983 genomes.…”

Section: Preliminary Experimentsmentioning

confidence: 99%

“…To keep pace with this, several algorithms that go beyond the concept of sequence alignment have been developed, called alignment-free [35]. Alignment-free approaches have been explored in several large-scale biological applications ranging, for instance, from DNA sequence comparison [12,28,14,19,27] to whole-genome phylogeny construction [34,15,13,23,33] and the classification of protein sequences [14]. Most alignment-free approaches above mentioned require, each with its own specific approach and with the use of appropriate data structures, the computation of statistics of the sequences of the analyzed collections.…”

Section: Introductionmentioning

confidence: 99%

The Colored Longest Common Prefix Array Computed via Sequential Scans

Garofalo¹,

Rosone

Sciortino³

et al. 2018

String Processing and Information Retrieval

View full text Add to dashboard Cite

Due to the increased availability of large datasets of biological sequences, the tools for sequence comparison are now relying on efficient alignment-free approaches to a greater extent. Most of the alignment-free approaches require the computation of statistics of the sequences in the dataset. Such computations become impractical in internal memory when very large collections of long sequences are considered. In this paper, we present a new conceptual data structure, the colored longest common prefix array (cLCP), that allows to efficiently tackle several problems with an alignment-free approach. In fact, we show that such a data structure can be computed via sequential scans in semi-external memory. By using cLCP, we propose an efficient lightweight strategy to solve the multi-string Average Common Substring (ACS) problem, that consists in the pairwise comparison of a single string against a collection of m strings simultaneously, in order to obtain m ACS induced distances. Experimental results confirm the effectiveness of our approach.

show abstract

ALFRED: A Practical Method for Alignment-Free Distance Computation

Cited by 26 publications

References 23 publications

Fast and accurate phylogeny reconstruction using filtered spaced-word matches

Fast and accurate phylogeny reconstruction using filtered spaced-word matches

A greedy alignment-free distance estimator for phylogenetic inference

The Colored Longest Common Prefix Array Computed via Sequential Scans

Contact Info

Product

Resources

About