Flexible seed size enables ultra-fast and accurate read alignment

Sahlin, Kristoffer

doi:10.1101/2021.06.18.449070

Cited by 5 publications

(10 citation statements)

References 50 publications

(153 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It has been shown that strobemers allow for much higher conservation (called match-coverage in Sahlin, 2021a ) than k-mers. StrobeAlign ( Sahlin, 2021b ) is a new short-read aligner that combines syncmers and strobemers for extremely efficient alignment. Another example is the LCP (locally consistent parsing) technique ( Hach et al , 2012 ; Sahinalp and Vishkin, 1996 ), which selects varying length substrings instead of k-mers in a locally consistent manner (i.e.…”

Section: Discussionmentioning

confidence: 99%

Theory of local k-mer selection with applications to long-read alignment

Shaw

2021

Bioinformatics

View full text Add to dashboard Cite

Motivation Selecting a subset of k-mers in a string in a local manner is a common task in bioinformatics tools for speeding up computation. Arguably the most well-known and common method is the minimizer technique, which selects the ‘lowest-ordered’ k-mer in a sliding window. Recently, it has been shown that minimizers may be a sub-optimal method for selecting subsets of k-mers when mutations are present. There is, however, a lack of understanding behind the theory of why certain methods perform well. Results We first theoretically investigate the conservation metric for k-mer selection methods. We derive an exact expression for calculating the conservation of a k-mer selection method. This turns out to be tractable enough for us to prove closed-form expressions for a variety of methods, including (open and closed) syncmers, (a, b, n)-words, and an upper bound for minimizers. As a demonstration of our results, we modified the minimap2 read aligner to use a more conserved k-mer selection method and demonstrate that there is up to an 8.2% relative increase in number of mapped reads. However, we found that the k-mers selected by more conserved methods are also more repetitive, leading to a runtime increase during alignment. We give new insight into how one might use new k-mer selection methods as a reparameterization to optimize for speed and alignment quality. Availability and implementation Simulations and supplementary methods are available at https://github.com/bluenote-1577/local-kmer-selection-results. os-minimap2 is a modified version of minimap2 and available at https://github.com/bluenote-1577/os-minimap2. Supplementary information Supplementary data are available at Bioinformatics online.

show abstract

Section: Discussionmentioning

confidence: 99%

Theory of local k-mer selection with applications to long-read alignment

Shaw

2021

Bioinformatics

View full text Add to dashboard Cite

show abstract

“…Strobemers are constructed by linking together a set of smaller k-mers and can be constructed with several different methods to link the k-mers (minstrobes, randstrobes, hybridstrobes), yielding different properties. It was shown that Strobemers could offer higher sensitivity and specificity over k-mers, and they have been used for short-read mapping [38], long-read overlap detection [18], and transcriptomic long-read normalization [33].…”

Section: Other Seed Constructsmentioning

confidence: 99%

“…The definition of E-hits was given in [38] and is a measure of how repetitive the seeds in a query sequence are, on average, in a reference dataset. More specifically, the E-hits computes the expected number of hits that seeds constructed from a query sequence obtained uniformly at random from the reference will have.…”

Section: E-hits Of Seedsmentioning

confidence: 99%

“…Nevertheless, it would, for example, be beneficial to understand what subsampling density is needed to make protocols similar in performance. Fourthly, since the minimap2 implementation is centered around minimizers, it is possible that aligners customized for, e.g., strobemers or other fuzzy seeds may enjoy an even more substantial performance gain, as shown for short-read alignment [38].…”

Section: Future Workmentioning

confidence: 99%

See 1 more Smart Citation

Entropy predicts sensitivity of pseudo-random seeds

Maier

Sahlin

2022

Preprint

Self Cite

View full text Add to dashboard Cite

In sequence similarity search applications such as read mapping, it is desired that seeds match between a read and reference in regions with mutations or read errors (sensitivity) but do not produce redundant matches due to repeats (specificity). K-mers are likely the most well-known and used seed construct in bioinformatics, and many studies on, e.g., spaced k-mers aim to improve sensitivity and specificity over k-mers. Recently, we developed a fuzzy seeding construct, strobemers, which were empirically demonstrated to have high sensitivity and specificity, but the study lacked a deeper understanding of why. In this study, we demonstrate that the entropy of a seed cover (a stretch of neighboring seeds) is a good predictor for seed sensitivity. We propose a model to estimate the entropy of a seed cover, and find that seed covers with high entropy typically have high match sensitivity. We also present two new strobemer seed constructs, mixedstrobes, and altstrobes. We use both simulated and biological data to demonstrate that mixedstrobes and altstrobes improves sequence matching sensitivity to other strobemers. We implement strobemers into minimap2 and observe slightly faster alignment time and higher accuracy than using k-mers at various error rates. We believe the most important aspect of this work is our discovered seed stochasticity-sensitivity relationship. The relationship provides a clear explanation of why some fuzzy seeds perform better than others and a framework for designing even more sensitive seeds. In addition, we show that the two new seed constructs, mixedstrobes, and altstrobes, are practically useful. Finally, in cases where our entropy model does not predict the observed sensitivity well, we explain why and how to improve the model in future work.

show abstract

“…As the number and depth of high-throughput sequencing experiments grows, efficient methods to map, store, and search DNA sequences have become critical in their analysis. Sequence sketching is a fundamental building block of many of the basic sequence analysis tasks, such as assembly [20,4], alignment [22,19,11], and binning [2,1,6]. The common principle in all sketching techniques is the selection of a k-mer representative from a long DNA sequence for indexing sequences in data structures or algorithms.…”

Section: Introductionmentioning

confidence: 99%

Efficient minimizer orders for large values ofkusing minimum decycling sets

Pellow

Ekim

et al. 2022

Preprint

View full text Add to dashboard Cite

Minimizers are ubiquitously used in data structures and algorithms for efficient searching, mapping, and indexing of high-throughput DNA sequencing data. Minimizer schemes select a minimumk-mer in everyL-long sub-sequence of the target sequence, where minimality is with respect to a predefinedk-mer order. Commonly used minimizer orders select morek-mers overall than necessary and therefore provide limited improvement to runtime and memory usage of downstream analysis tasks. The recently introduced universalk-mer hitting sets produce minimizer orders resulting in fewer selectedk-mers. Unfortunately, generating compact universalk-mer hitting sets is currently infeasible fork >13, and thus cannot help in the many applications that need minimizers of largerk.Here, we close this gap by introducingdecycling set-based minimizer orders. We define new orders based on minimum decycling sets, which are guaranteed to hit any infinitely long sequence. We show that in practice these new minimizer orders select a number ofk-mers comparable to that of minimizer orders based on universalk-mer hitting sets, and can also scale up to largerk. Furthermore, we developed a query method that avoids the need to keep thek-mers of a decycling set in memory, which enables the use of these minimizer orders for any value ofk. We expect the new decycling set-based minimizer orders to improve the runtime and memory usage of algorithms and data structures in high-throughput DNA sequencing analysis.

show abstract

Flexible seed size enables ultra-fast and accurate read alignment

Cited by 5 publications

References 50 publications

Theory of local k-mer selection with applications to long-read alignment

Theory of local k-mer selection with applications to long-read alignment

Entropy predicts sensitivity of pseudo-random seeds

Efficient minimizer orders for large values ofkusing minimum decycling sets

Contact Info

Product

Resources

About