Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches

Horwege, Sebastian; Lindner, Sebastian; Boden, Marcus; Hatje, Klas; Kollmar, Martin; Leimeister, Chris-André; Morgenstern, Burkhard

doi:10.1093/nar/gku398

Cited by 65 publications

(49 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For large-scale comparisons of genome-scale sequences, especially highly diverse ones, alignment-free methods of phylogeny construction have been increasingly used in the past few years23242526. There are two categories of alignment-free methods for phylogenomic analysis: one based on the statistics of word frequency, the other on Kolmogorov complexity and chaos theory27.…”

mentioning

confidence: 99%

Viral Phylogenomics Using an Alignment-Free Method: A Three-Step Approach to Determine Optimal Length of k-mer

Zhang

Jun

Leuze

et al. 2017

Sci Rep

View full text Add to dashboard Cite

The development of rapid, economical genome sequencing has shed new light on the classification of viruses. As of October 2016, the National Center for Biotechnology Information (NCBI) database contained >2 million viral genome sequences and a reference set of ~4000 viral genome sequences that cover a wide range of known viral families. Whole-genome sequences can be used to improve viral classification and provide insight into the viral “tree of life”. However, due to the lack of evolutionary conservation amongst diverse viruses, it is not feasible to build a viral tree of life using traditional phylogenetic methods based on conserved proteins. In this study, we used an alignment-free method that uses k-mers as genomic features for a large-scale comparison of complete viral genomes available in RefSeq. To determine the optimal feature length, k (an essential step in constructing a meaningful dendrogram), we designed a comprehensive strategy that combines three approaches: (1) cumulative relative entropy, (2) average number of common features among genomes, and (3) the Shannon diversity index. This strategy was used to determine k for all 3,905 complete viral genomes in RefSeq. The resulting dendrogram shows consistency with the viral taxonomy of the ICTV and the Baltimore classification of viruses.

show abstract

mentioning

confidence: 99%

Viral Phylogenomics Using an Alignment-Free Method: A Three-Step Approach to Determine Optimal Length of k-mer

Zhang

Jun

Leuze

et al. 2017

Sci Rep

View full text Add to dashboard Cite

show abstract

“…Although, The problem of time-shift can be solved based on FFT, frequency domain analysis can't solve the interrelationship on timing sequence accurately. Therefore, after steps of screening for j B , pairwise points should be checked in time domain so as to improve recognition accuracy based on high-level semantic [7]. In experiment, k is set to 3~8.…”

Section: Vectors Extracting and Screeningmentioning

confidence: 99%

An Approach for TV Channel Recognition based on Audio

Long¹

2015

Proceedings of the 2015 4th International Conference on Computer, Mechatronics, Control and Electronic Engineering

View full text Add to dashboard Cite

Abstract. In order to amalgamate distributing platform of traditional TV program with mobile Internet on technology and business, it is necessary to study the method of TV channel recognition based on the collaboration between mobile terminals and publishing service of TV program. Due to some advantages based on audio features, such as less processing data volume, lower complexity of signal variation, better realtime performance and non-directional sampling, some processing steps, such as data denoising, data standardization processing, sequence alignment, tolerance processing and pairwise point checking, etc., are studied for TV channel recognition based on audio features. The experimental results show that the overall performance of the proposed approach is better than those based on frequency domain and based on time domain simply because it has some advantages such as reduction of both sampling difference and sampling environment interference for various of handheld mobile terminals, reduction of transmitting interference for content distributing server-side, excellent overall performance on accuracy and efficiency, etc. The proposed approach can also be applied into data pushing, user interactive discussion, realtime vote, etc.

show abstract

“…Some sequence matches are also missed due to insertion and deletions between key residue positions of a novel protein. In such cases, direct methods of functional annotation, which rely on scanning a sequence through sliding windows or use global summary of sequence properties such as amino acid composition have proved useful . Protein‐function annotation on a large scale is done by use of Gene Ontologies .…”

Section: Introductionmentioning

confidence: 99%

“…In such cases, direct methods of functional annotation, which rely on scanning a sequence through sliding windows or use global summary of sequence properties such as amino acid composition have proved useful. [4][5][6][7][8] Protein-function annotation on a large scale is done by use of Gene Ontologies. 4 However, focusing on individual, well understood biological functions and annotating specific Biological functions gives much more power to a predictive method, as the annotations can incorporate knowledge specifically relevant for that system.…”

Section: Introductionmentioning

confidence: 99%

Enabling full‐length evolutionary profiles based deep convolutional neural network for predicting DNA‐binding proteins from sequence

Chauhan

Ahmad

2019

Proteins

View full text Add to dashboard Cite

Sequence based DNA‐binding protein (DBP) prediction is a widely studied biological problem. Sliding windows on position specific substitution matrices (PSSMs) rows predict DNA‐binding residues well on known DBPs but the same models cannot be applied to unequally sized protein sequences. PSSM summaries representing column averages and their amino‐acid wise versions have been effectively used for the task, but it remains unclear if these features carry all the PSSM's predictive power, traditionally harnessed for binding site predictions. Here we evaluate if PSSMs scaled up to a fixed size by zero‐vector padding (pPSSM) could perform better than the summary based features on similar models. Using multilayer perceptron (MLP) and deep convolutional neural network (CNN), we found that (a) Summary features work well for single‐genome (human‐only) data but are outperformed by pPSSM for diverse PDB‐derived data sets, suggesting greater summary‐level redundancy in the former, (b) even when summary features work comparably well with pPSSM, a consensus on the two outperforms both of them (c) CNN models comprehensively outperform their corresponding MLP models and (d) actual predicted scores from different models depend on the choice of input feature sets used whereas overall performance levels are model‐dependent in which CNN leads the accuracy.

show abstract

Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches

Cited by 65 publications

References 22 publications

Viral Phylogenomics Using an Alignment-Free Method: A Three-Step Approach to Determine Optimal Length of k-mer

Viral Phylogenomics Using an Alignment-Free Method: A Three-Step Approach to Determine Optimal Length of k-mer

An Approach for TV Channel Recognition based on Audio

Enabling full‐length evolutionary profiles based deep convolutional neural network for predicting DNA‐binding proteins from sequence

Contact Info

Product

Resources

About