String kernels for protein sequence comparisons: improved fold recognition

Nojoomi, Saghi; Koehl, Patrice

doi:10.1186/s12859-017-1560-9

Cited by 4 publications

(13 citation statements)

References 61 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The weighted string kernel considered here, referred to as WSeqKernel, is inspired by the convolution string kernels introduced by D. Haussler [ 36 ], the local alignment kernel presented by Saigo et al [ 27 ], and the string kernel of Smale and co-workers [ 28 ]. An unweighted version was presented in details in Nojoomi and Koehl [ 29 ]. We provide here the key elements of its construction, emphasizing the differences with those kernels.…”

Section: Methodsmentioning

confidence: 99%

“… is the sequence kernel considered in this paper. Following [ 28 , 29 , 36 ], we make the following remarks: The input kernel matrix G is not a traditional substitution matrix, as it does not involve applying the logarithm function on the probability measures. While the latter is needed to make scores additive, a necessary condition to enable the use of dynamic programming algorithms to generate pairwise sequence alignment, it is not needed for the string kernel we use here.…”

Section: Methodsmentioning

confidence: 99%

“…A solution to this limitation was proposed, the so-called spaced seeds methods that defines patterns with match and possible don’t care positions [ 18 – 21 ]. Another class of alignment-free methods for comparing protein sequences that are directly relevant to this work are the string kernel based methods [ 22 – 29 ].…”

Section: Introductionmentioning

confidence: 99%

“…In this paper we describe a new weighted string kernel that attempts to combine the benefits of the local string kernels [ 27 , 28 ] that use a substitution matrix and of the weighted degree kernels that consider weighted sums of kernels obtained with fixed length k-mers [ 25 ]. It is an extension of a preliminary study in which we introduced an unweighted kernel, SeqKernel, and showed its applications to protein fold recognition [ 29 ]. In this preliminary study, we have shown that the kernel values computed by SeqKernel show dependencies on sequence length, and that those dependencies can be minimized by changing the values of its parameters.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

A weighted string kernel for protein fold recognition

Nojoomi

Koehl²

2017

BMC Bioinformatics

Self Cite

View full text Add to dashboard Cite

BackgroundAlignment-free methods for comparing protein sequences have proved to be viable alternatives to approaches that first rely on an alignment of the sequences to be compared. Much work however need to be done before those methods provide reliable fold recognition for proteins whose sequences share little similarity. We have recently proposed an alignment-free method based on the concept of string kernels, SeqKernel (Nojoomi and Koehl, BMC Bioinformatics, 2017, 18:137). In this previous study, we have shown that while Seqkernel performs better than standard alignment-based methods, its applications are potentially limited, because of biases due mostly to sequence length effects.MethodsIn this study, we propose improvements to SeqKernel that follows two directions. First, we developed a weighted version of the kernel, WSeqKernel. Second, we expand the concept of string kernels into a novel framework for deriving information on amino acids from protein sequences.ResultsUsing a dataset that only contains remote homologs, we have shown that WSeqKernel performs remarkably well in fold recognition experiments. We have shown that with the appropriate weighting scheme, we can remove the length effects on the kernel values. WSeqKernel, just like any alignment-based sequence comparison method, depends on a substitution matrix. We have shown that this matrix can be optimized so that sequence similarity scores correlate well with structure similarity scores. Starting from no information on amino acid similarity, we have shown that we can derive a scoring matrix that echoes the physico-chemical properties of amino acids.ConclusionWe have made progress in characterizing and parametrizing string kernels as alignment-based methods for comparing protein sequences, and we have shown that they provide a framework for extracting sequence information from structure.Electronic supplementary materialThe online version of this article (doi:10.1186/s12859-017-1795-5) contains supplementary material, which is available to authorized users.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

A weighted string kernel for protein fold recognition

Nojoomi

Koehl²

2017

BMC Bioinformatics

Self Cite

View full text Add to dashboard Cite

show abstract

“…They are, however, computationally intensive to evaluate. With current databases of biological sequences at the order of hundreds of gigabytes, alternatives have been proposed both as faster, heuristic algorithms and as easier to compute similarity measures [32,33,3,25,11].…”

Section: Introductionmentioning

confidence: 99%

Algorithms to compute the Burrows-Wheeler Similarity Distribution

Louza

Telles

Gog

et al. 2019

Theoretical Computer Science

View full text Add to dashboard Cite

The Burrows-Wheeler transform (BWT) is a well studied text transformation widely used in data compression and text indexing. The BWT of two strings can also provide similarity measures between them, based on the observation that the more their symbols are intermixed in the transformation, the more the strings are similar. In this article we present two new algorithms to compute similarity measures based on the BWT for string collections. In particular, we present practical and theoretical improvements to the computation of the Burrows-Wheeler similarity distribution for all pairs of strings in a collection. Our algorithms take advantage of the BWT computed for the concatenation of all strings, and use compressed data structures that allow reducing the running time with a small memory footprint, as shown by a set of experiments with real and artificial datasets.

show abstract

Editorial

Matsudaira¹,

Verma²

2019

Progress in Biophysics and Molecular Biology

View full text Add to dashboard Cite

String kernels for protein sequence comparisons: improved fold recognition

Cited by 4 publications

References 61 publications

A weighted string kernel for protein fold recognition

A weighted string kernel for protein fold recognition

Algorithms to compute the Burrows-Wheeler Similarity Distribution

Editorial

Contact Info

Product

Resources

About