GaKCo: A Fast Gapped k-mer String Kernel Using Counting

Singh, Ritambhara; Sekhon, Arshdeep; Kowsari, Kamran; Lanchantin, Jack; Wang, Beilun; Qi, Yanjun

doi:10.1007/978-3-319-71249-9_22

Cited by 24 publications

(14 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…BaselinesWe compare the prediction accuracy and efficiency of FastSK with 3 state-of-the-art string kernel baselines. For DNA and protein data, we baseline against gkmSVM-2.0 [8] and GaKCo [27]. For an NLP string kernel baseline, we use the Blended Spectrum Kernel [12,11], as it has recently achieved strong results in natural language processing.…”

Section: Experimental Setup and Resultsmentioning

confidence: 99%

“…FastSK directly counts the gapped k-mers shared between sequences, previous works (e.g. [7,8,18,19,27]) indirectly compute the kernel function by inferring the counts from a set of mismatch statistics. These methods take inspiration from [17], which uses the notion of a mismatch neighborhood to efficiently compute the (k, m)-mismatch kernel.…”

Section: Connecting To Related Work Mismatch Statistic-based String Kmentioning

confidence: 99%

“…Counting ImplementationsGaKCo [27] is similar to FastSK in that it uses uses a sort-and-count algorithm. However, it differs from FastSK in that it follows the mismatch statistic formulation from [7].…”

Section: Connecting To Related Work Mismatch Statistic-based String Kmentioning

confidence: 99%

“…String kernels in conjunction with Support Vector Machines (SK-SVM) achieve strong prediction performance across a variety of sequence analysis tasks, with widespread use in bioinformatics and natural language processing (NLP). SK-SVMs are a popular technique for DNA regulatory element identification [22,10,25,7,19,27,8], and bio-medical named entity recognition [23,15]. SK-SVMs are also popular baselines for evaluating the quality of deep learning models [3,10,25] for analyzing variant impacts.…”

Section: Introductionmentioning

confidence: 99%

“…Therefore, the feature vectors are both extremely large and extremely sparse, which makes models trained on these feature vectors highly prone to overfitting and poor generalization. Third, existing algorithms that overcome these challenges leave much to be desired; for example, popular trie-based approaches (e.g., [7,4]) still exhibit exponential dependence on |Σ| [20,27], k, and m. On the other hand, counting-based methods rely on complex "mismatch statistics" to indirectly obtain feature counts [17,6,27], however still fail to scale to greater feature lengths. Together, these issues present major limitations to the practical utility of k-mer string kernel methods.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

FastSK: Fast Sequence Analysis with Gapped String Kernels

Blakely

Collins

Singh

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

AbstractGapped k-mer kernels with Support Vector Machines (gkm-SVMs) have achieved strong predictive performance on regulatory DNA sequences on modestly-sized training sets. However, existing gkm-SVM algorithms suffer from slow kernel computation time, as they depend exponentially on the sub-sequence feature-length, number of mismatch positions, and the task’s alphabet size. In this work, we introduce a fast and scalable algorithm for calculating gapped k-mer string kernels. Our method, named FastSK, uses a simplified kernel formulation that decomposes the kernel calculation into a set of independent counting operations over the possible mismatch positions. This simplified decomposition allows us to devise a fast Monte Carlo approximation that rapidly converges. FastSK can scale to much greater feature lengths, allows us to consider more mismatches, and is performant on a variety of sequence analysis tasks. On 10 DNA transcription factor binding site (TFBS) prediction datasets, FastSK consistently matches or outperforms the state-of-the-art gkmSVM-2.0 algorithms in AUC, while achieving average speedups in kernel computation of ∼100× and speedups of ∼800× for large feature lengths. We further show that FastSK outperforms character-level recurrent and convolutional neural networks across all 10 TFBS tasks. We then extend FastSK to 7 English-language medical named entity recognition datasets and 10 protein remote homology detection datasets. FastSK consistently matches or outperforms these baselines. Our algorithm is available as a Python package and as C++ source code1.

show abstract

Section: Experimental Setup and Resultsmentioning

confidence: 99%

Section: Connecting To Related Work Mismatch Statistic-based String Kmentioning

confidence: 99%

Section: Connecting To Related Work Mismatch Statistic-based String Kmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

FastSK: Fast Sequence Analysis with Gapped String Kernels

Blakely

Collins

Singh

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

The unreasonable effectiveness of machine learning in Moldavian versus Romanian dialect identification

Găman

Ionescu

2021

Int J of Intelligent Sys

View full text Add to dashboard Cite

Motivated by the seemingly high accuracy levels of machine learning (ML) models in Moldavian versus Romanian dialect identification and the increasing research interest on this topic, we provide a follow-up on the Moldavian versus Romanian Cross-Dialect Topic Identification (MRC) shared task of the VarDial 2019 evaluation campaign. The shared task included two subtask types: one that consisted in discriminating between the Moldavian and Romanian dialects and one that consisted in classifying documents by topic across the two dialects of Romanian. Participants achieved impressive scores, for example, the top model for Moldavian versus Romanian dialect identification obtained a macro-F 1 score of 0.895. We conduct a subjective evaluation by human annotators, showing that humans attain much lower accuracy rates compared with ML models. Hence, it remains unclear why the methods proposed by participants attain such high accuracy rates. Our goal is to understand (i) why the proposed methods work so well (by visualizing the discriminative features) and (ii) to what extent these methods can keep their high accuracy levels, for example, when we shorten the text samples to single sentences or when we use tweets at inference time. A secondary goal of our

show abstract