Theory of local k-mer selection with applications to long-read alignment

Shaw, Jim; Yu, Yun William

doi:10.1093/bioinformatics/btab790

Cited by 32 publications

(34 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…These new k-mers are called seeds instead of markers because we actually use them as seeds for k-mer matching and alignment. We note that while we could have used other "context-independent" k-mer seeding methods that are more "conserved" than FracMinHash [28], we found that FracMinHash works well enough for relatively sparse seeds when c ≫ k. By default, k = 15 and c = 125. We note that the small value of k = 15 used by default leads to too many repetitive anchors on larger genomes, so we mask the top seeds that occur more than 2500/c times by default.…”

Section: Obtaining Sparse Seeds For Chainingmentioning

confidence: 88%

“…These new k-mers are called seeds instead of markers because we actually use them as seeds for k-mer matching and alignment. We note that while we could have used other “context-independent” k-mer seeding methods that are more “conserved” than FracMinHash [28], we found that FracMinHash works well enough for relatively sparse seeds when c ≫ k . By default, k = 15 and c = 125.…”

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

Fast and robust metagenomic sequence comparison through sparse chaining with skani

Shaw

2023

Preprint

Self Cite

View full text Add to dashboard Cite

Sequence comparison algorithms for metagenome-assembled genomes (MAGs) often have difficulties dealing with data that is high-volume or low-quality. We present skani, a method for calculating average nucleotide identity (ANI) using sparse approximate alignments. skani is more accurate than FastANI for comparing incomplete, fragmented MAGs while also being > 20 times faster. For searching a database of > 65,000 prokaryotic genomes, skani takes only seconds per query and 5 GB of memory. skani is a versatile tool that unlocks higher-resolution insights for larger, noisier metagenomic data sets.

show abstract

Section: Obtaining Sparse Seeds For Chainingmentioning

confidence: 88%

Section: Methodsmentioning

confidence: 99%

Fast and robust metagenomic sequence comparison through sparse chaining with skani

Shaw

2023

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…The original open syncmer definition in (Edgar, 2021) had a parameter t where a k-mer was selected if the smallest s-mer was in the t−th position; we proved in (Shaw and Yu, 2022) that the optimal t is ⌈ k−s+1 2 ⌉ with respect to maximizing the number of conserved bases from k-mer matching. The reason we choose open syncmers is primarily to the following fact which was shown in (Edgar, 2021): Theorem 7 follows by examining the smallest s-mer in a k-mer and noticing that in the next overlapping k-mer, the locations for the new smallest s-mer are restricted.…”

Section: Sketching and Local K-mer Selectionmentioning

confidence: 91%

“…) which we will ignore; see (Zheng et al, 2020) or (Shaw and Yu, 2022)). We will let c be the reciprocal of the density, so c = (k − s + 1).…”

Section: Sketching and Local K-mer Selectionmentioning

confidence: 99%

Sequence aligners can guarantee accuracy in almostO(mlogn) time: a rigorous average-case analysis of the seed-chain-extend heuristic

Shaw

2022

Preprint

Self Cite

View full text Add to dashboard Cite

Seed-chain-extend with k-mer seeds is a powerful heuristic technique for sequence alignment employed by modern sequence aligners. While effective in practice for both runtime and accuracy, theoretical guarantees on the resulting alignment do not exist for seed-chain-extend. In this work, we give the first rigorous bounds for the efficacy of seed-chain-extend with k-mers in expectation. Assume we are given a random nucleotide sequence of length ≈ n that is indexed (or seeded) and a mutated substring of length ≈ m ≤ n with mutation rate θ < 0.206. We prove that we can find a k = Θ(log n) for the k-mer size such that the expected runtime of seed-chain-extend under optimal linear gap cost chaining and quadratic time gap extension is O(mnf(θ)log n) where f(θ) < 2.43·θ holds as a loose bound. In fact, for reasonable θ = 0.05, f(θ) < 0.08, indicating nearly quasilinear running time in practice. The alignment also turns out to be good; we prove that more than 1 − O(1/√m) fraction of the homologous bases are recoverable under an optimal chain. We also show that our bounds work when k-mers are sketched, i.e. only a subset of all k-mers is selected. Under the open syncmer sketching method, one can sketch with decreasing density as a function of n and achieve asymptotically smaller chaining time, yet the same bounds for extension time and recoverability hold. In other words, sketching reduces chaining time without increasing alignment time or decreasing accuracy too much, justifying the effectiveness of sketching as a practical speedup in sequence alignment. We verify our results in simulation and conjecture that f(θ) can be further reduced.

show abstract

“…The main way of analyzing genetic sequences is by comparing them to each other. For large data, this is usually done via ‘seeds’, by which we mean simple similarities that can be found quickly ( Shaw and Yu, 2022 ). The simplest seeds are fixed-length exact matches, but they can also be inexact ( Altschul et al , 1990 ; Ma et al , 2002 ; Noé and Kucherov, 2004 ; Sahlin, 2021 ) and/or variable length ( Csűrös, 2004 ).…”

Section: Introductionmentioning

confidence: 99%

How to optimally sample a sequence for rapid analysis

2023

Self Cite

View full text Add to dashboard Cite

Motivation We face an increasing flood of genetic sequence data, from diverse sources, requiring rapid computational analysis. Rapid analysis can be achieved by sampling a subset of positions in each sequence. Previous sequence-sampling methods, such as minimizers, syncmers, and minimally-overlapping words, were developed by heuristic intuition, and are not optimal. Results We present a sequence-sampling approach that provably optimizes sensitivity for a whole class of sequence comparison methods, for randomly-evolving sequences. It it likely near-optimal for a wide range of alignment-based and alignment-free analyses. For real biological DNA, it increases specificity by avoiding simple repeats. Our approach generalizes universal hitting sets (which guarantee to sample a sequence at least once), and polar sets (which guarantee to sample a sequence at most once). This helps us understand how to do rapid sequence analysis as accurately as possible. Availability and Implementation Source code freely available at https://gitlab.com/mcfrith/noverlap. Supplementary information Supplementary data are available at Bioinformatics online.

show abstract

Theory of local k-mer selection with applications to long-read alignment

Cited by 32 publications

References 37 publications

Fast and robust metagenomic sequence comparison through sparse chaining with skani

Fast and robust metagenomic sequence comparison through sparse chaining with skani

Sequence aligners can guarantee accuracy in almostO(mlogn) time: a rigorous average-case analysis of the seed-chain-extend heuristic

How to optimally sample a sequence for rapid analysis

Contact Info

Product

Resources

About