Sequence Comparison without Alignment: The<i>SpaM</i>approaches

Morgenstern, Burkhard

doi:10.1101/2019.12.16.878314

Cited by 4 publications

(6 citation statements)

References 86 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Here, a spaced-word match is a pair of words from two sequences that are identical at certain positions, specified by a pre-defined binary pattern of match and don't-care positions, see [39] for a short review of alignment-free approaches based on spaced-word matches.…”

Section: Data Availability Statementmentioning

confidence: 99%

“…Skmer [37] is a further improvement of this approach. In a previous paper, we proposed another way to infer evolutionary distances between DNA sequences based on the number of word matches between them, and we generalized this to so-called spaced-word matches [38].Here, a spaced-word match is a pair of words from two sequences that are identical at certain positions, specified by a pre-defined binary pattern of match and don't-care positions, see [39] for a short review of alignment-free approaches based on spaced-word matches.The distance function proposed in [38] is now used by default in the program Spaced [40]. Theoretically, this distance measure is based on a simple model of molecular evolution without insertions or deletions.…”

mentioning

confidence: 99%

See 1 more Smart Citation

The number of k-mer matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances

et al. 2020

Self Cite

View full text Add to dashboard Cite

We study the number N k of length-k word matches between pairs of evolutionarily related DNA sequences, as a function of k. We show that the Jukes-Cantor distance between two genome sequences-i.e. the number of substitutions per site that occurred since they evolved from their last common ancestor-can be estimated from the slope of a function F that depends on N k and that is affine-linear within a certain range of k. Integers k min and k max can be calculated depending on the length of the input sequences, such that the slope of F in the relevant range can be estimated from the values F(k min ) and F(k max ). This approach can be generalized to so-called Spaced-word Matches (SpaM), where mismatches are allowed at positions specified by a user-defined binary pattern. Based on these theoretical results, we implemented a prototype software program for alignment-free sequence comparison called Slope-SpaM. Test runs on real and simulated sequence data show that Slope-SpaM can accurately estimate phylogenetic distances for distances up to around 0.5 substitutions per position. The statistical stability of our results is improved if spaced words are used instead of contiguous words. Unlike previous alignment-free methods that are based on the number of (spaced) word matches, Slope-SpaM produces accurate results, even if sequences share only local homologies. OPEN ACCESSCitation: Röhling S, Linne A, Schellhorn J, Hosseini M, Dencker T, Morgenstern B (2020) The number of k-mer matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances. PLoS ONE 15(2): e0228070. https://doi.org/10. Data Availability Statement:The source code of our software is freely available through GitHub Traditionally, phylogenetic distances are inferred from pairwise or multiple sequence alignments. For the huge amounts of sequence data that are now available, however, sequence alignment has become too slow. Therefore, considerable efforts have been made in recent years, to develop fast alignment-free approaches that can estimate phylogenetic distances without the need to calculate full alignments of the input sequences, see [3][4][5][6][7] for recent review articles. Alignment-free approaches are not only used in phylogeny reconstruction, but are also important in metagenomics [8][9][10], to find genome rearrangements [11] and in epidemiology [12] and other medical applications, for example to identify drug-resistant bacteria [13] or to classify viruses [14,15]. In all these applications, it is crucial to rapidly estimate pairwise similarity or dissimilarity values in large sets of sequence data.Some alignment-free approaches are based on word frequencies [16,17] or on the length of common substrings [18][19][20]. Other methods use variants of the D 2 distance which is defined as the number of word matches of a pre-defined length between two sequences [15,[21][22][23]; a review focusing on these methods is given in [24]. kWIP [25] is a further development of this concept that uses information-theoretical...

show abstract

Section: Data Availability Statementmentioning

confidence: 99%

mentioning

confidence: 99%

The number of k-mer matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances

et al. 2020

Self Cite

View full text Add to dashboard Cite

show abstract

“…Furthermore, for a substitution matrix assigning a score to any two symbols of the nucleotide alphabet A, we define the score of a spaced word match as the sum of all substitution scores of nucleotide pairs aligned to each other at the don't care positions of P . Spaced-word matches -called spaced seeds in this context -have been originally introduced in sequence-database searching [24]; later they were applied in alignment-free sequence comparison, to estimate phylogenetic distances between DNA and protein sequences [30,21,20,32], see [29] for a review.…”

Section: Definitionsmentioning

confidence: 99%

Phylogenetic placement of short reads without sequence alignment

Morgenstern

Blanke

2020

Preprint

Self Cite

View full text Add to dashboard Cite

Phylogenetic placement is the task of placing a query sequence of unknown taxonomic origin into a given phylogenetic tree of a set of reference sequences. Several approaches to phylogenetic placement have been proposed in recent years. The most accurate of them need a multiple alignment of the reference sequences as input. Most of them also need alignments of the query sequences to the multiple alignment of the reference sequences. A major field of application of phylogenetic placement is taxonomic read assignment in metagenomics. Herein, we propose App-SpaM, an efficient alignment-free algorithm for phylogenetic placement of short sequencing reads on a tree of a set of reference genomes. App-SpaM is based on the Filtered Spaced Word Matches approach that we previously developed. Unlike other methods, our approach neither requires a multiple alignment of the reference genomes, nor alignments of the queries to the reference sequences. Moreover, App-SpaM works not only on assembled reference genomes, but can also take reference taxa as input for which only unassembled read sequences are available. The quality of the results achieved with App-SpaM is comparable to the best available approaches to phylogenetic placement. However, since App-SpaM is not based on sequence alignment, it is between one and two orders of magnitude faster than those existing methods.

show abstract

“…These methods use different techniques, such as dynamic programming, pairwise comparison, and heuristic methods associated with similarity metrics between the nucleotide sequences. However, these methods have some limitations: (1) They require some prior knowledge of the reference sequence; (2) they admit that there is contiguity between homologous regions; (3) they are computationally expensive for long sequences (for example, for only two sequences of length N, we have (2N)!/

) possible gapped sequences and a time complexity of the length of the inputs and/or their products; and (4) they are not very efficient for specimens that have a high rate of genomic mutation, such as viruses [ 16 , 17 , 18 , 19 , 20 ].…”

Section: Introductionmentioning

confidence: 99%

Convolutional Neural Network Applied to SARS-CoV-2 Sequence Classification

Câmara

Coutinho

Silva

et al. 2022

Sensors

View full text Add to dashboard Cite

COVID-19, the illness caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) virus belonging to the Coronaviridade family, a single-strand positive-sense RNA genome, has been spreading around the world and has been declared a pandemic by the World Health Organization. On 17 January 2022, there were more than 329 million cases, with more than 5.5 million deaths. Although COVID-19 has a low mortality rate, its high capacities for contamination, spread, and mutation worry the authorities, especially after the emergence of the Omicron variant, which has a high transmission capacity and can more easily contaminate even vaccinated people. Such outbreaks require elucidation of the taxonomic classification and origin of the virus (SARS-CoV-2) from the genomic sequence for strategic planning, containment, and treatment of the disease. Thus, this work proposes a high-accuracy technique to classify viruses and other organisms from a genome sequence using a deep learning convolutional neural network (CNN). Unlike the other literature, the proposed approach does not limit the length of the genome sequence. The results show that the novel proposal accurately distinguishes SARS-CoV-2 from the sequences of other viruses. The results were obtained from 1557 instances of SARS-CoV-2 from the National Center for Biotechnology Information (NCBI) and 14,684 different viruses from the Virus-Host DB. As a CNN has several changeable parameters, the tests were performed with forty-eight different architectures; the best of these had an accuracy of 91.94 ± 2.62% in classifying viruses into their realms correctly, in addition to 100% accuracy in classifying SARS-CoV-2 into its respective realm, Riboviria. For the subsequent classifications (family, genera, and subgenus), this accuracy increased, which shows that the proposed architecture may be viable in the classification of the virus that causes COVID-19.

show abstract

Sequence Comparison without Alignment: TheSpaMapproaches

Cited by 4 publications

References 86 publications

The number of k-mer matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances

The number of k-mer matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances

Phylogenetic placement of short reads without sequence alignment

Convolutional Neural Network Applied to SARS-CoV-2 Sequence Classification

Contact Info

Product

Resources

About