Entropy predicts sensitivity of pseudorandom seeds

Maier, Benjamin Dominik; Sahlin, Kristoffer

doi:10.1101/gr.277645.123

Cited by 2 publications

(11 citation statements)

References 60 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For the runtime, we evaluated randstrobes parametrized as ( n = 2, l = 20, w min = 21, w max = 100) and ( n = 2, l = 20, w min = 21, w max = 1000) since the window size affects runtime. Strobemers with n > 3 show no substantial gain in the context of sequence matching at the cost of additional runtime [12](although they have been modified and used for specific scenarios [8]). Also, the relative performance can be extrapolated from the n = 2 and n = 3 cases, since the construction is recursive, therefore, we omit them in this study.…”

Section: Resultsmentioning

confidence: 99%

“…The methods to select strobes differ [18], and using alternating strobe lengths has also been explored [12]. However, randstrobes were shown to be more sensitive for sequence matching than other methods using fixed strobe lengths (minstrobes and hybridstrobes) [18], and simpler to construct than alternating strobe lengths (altstrobes and multistrobes) [12], and is so far most commonly implemented in practice, e.g., [20,15,23]. Therefore, we will consider only the randstrobes method in this study.…”

Section: Methodsmentioning

confidence: 99%

“…In [12], we also found that the sensitivity of strobemers, measured as producing at least one seed match in a mutated region of fixed length, is strongly correlated with the pseudo-randomness of the seed construct (measured through entropy), where higher entropy yields higher sensitivity. In [12], we also introduced new strobemer variations, further improving sequence matching performance. Despite the introduction of these new variations, randstrobes remain the simplest and most used construct.…”

Section: Introductionmentioning

confidence: 96%

“…While there are applications that use other strobemer types [8], randstrobes have been most frequently used, e.g., for short-read mapping [20], transcriptomic long-read normalization [15], and read classification [23] in bioinformatic applications. Our recent proof-of-concept study also shows that randstrobes can provide accurate sequence similarity ranking through estimating the Jaccard distance [12].…”

Section: Introductionmentioning

confidence: 99%

“…As randomness is important for sensitivity [12], we propose several new methods to perform the core operations in randstrobes (hashing, linking, and comparison) beyond previously published methods [18,20,23]. We also observe several types of bias (Fig.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Designing efficient randstrobes for sequence similarity analyses

Karami,

Mohammadi,

Martin

et al. 2023

Preprint

Self Cite

View full text Add to dashboard Cite

Substrings of length k, commonly referred to as k-mers, play a vital role in sequence analysis, reducing the search space by providing anchors between queries and references. However, k-mers are limited to exact matches between sequences. This has led to alternative constructs, such as spaced k-mers, that can match across substitutions. We recently introduced a class of new constructs, strobemers, that can match across substitutions and smaller insertions and deletions. Randstrobes, the most sensitive strobemer proposed in (Sahlin, 2021), has been incorporated into several bioinformatics applications such as read classification, short read mapping, and read overlap detection. Randstrobes are constructed by linking together k-mers in a pseudo-random fashion and depend on a hash function, a link function, and a comparator for their construction. Recently, we showed that the more random this linking appears (measured in entropy), the more efficient the seeds for sequence similarity analysis. The level of pseudo-randomness will depend on the hashing, linking, and comparison operators. However, no study has investigated the efficacy of the underlying operators to produce randstrobes. In this study, we propose several new construction methods. One of our proposed methods is based on a Binary Search Tree (BST), which lowers the time complexity and practical runtime to other methods for some parametrizations. To our knowledge, we are also the first to describe and study the types of biases that occur during construction. We designed three metrics to measure the bias. Using these new evaluation metrics, we uncovered biases and limitations in previous methods and showed that our proposed methods have favorable speed and sampling uniformity to previously proposed methods. Lastly, guided by our results, we change the seed construction in strobealign, a short-read mapper, and find that the results change substantially. Also, we suggest combining the two versions to improve accuracy for the shortest reads in our evaluated datasets. Our evaluation highlights sampling biases that can occur and provides guidance on which operators to use when implementing randstrobes.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 96%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Designing efficient randstrobes for sequence similarity analyses

Karami,

Mohammadi,

Martin

et al. 2023

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

Designing efficient randstrobes for sequence similarity analyses

Karami,

Soltani Mohammadi,

Martin

et al. 2024

Bioinformatics

View full text Add to dashboard Cite

Motivation Substrings of length k, commonly referred to as k-mers, play a vital role in sequence analysis. However, k-mers are limited to exact matches between sequences leading to alternative constructs. We recently introduced a class of new constructs, strobemers, that can match across substitutions and smaller insertions and deletions. Randstrobes, the most sensitive strobemer proposed in Sahlin (Effective sequence similarity detection with strobemers. Genome Res 2021a;31:2080–94. https://doi.org/10.1101/gr.275648.121), has been used in several bioinformatics applications such as read classification, short-read mapping, and read overlap detection. Recently, we showed that the more pseudo-random the behavior of the construction (measured in entropy), the more efficient the seeds for sequence similarity analysis. The level of pseudo-randomness depends on the construction operators, but no study has investigated the efficacy. Results In this study, we introduce novel construction methods, including a Binary Search Tree-based approach that improves time complexity over previous methods. To our knowledge, we are also the first to address biases in construction and design three metrics for measuring bias. Our evaluation shows that our methods have favorable speed and sampling uniformity compared to existing approaches. Lastly, guided by our results, we change the seed construction in strobealign, a short-read mapper, and find that the results change substantially. We suggest combining the two results to improve strobealign’s accuracy for the shortest reads in our evaluated datasets. Our evaluation highlights sampling biases that can occur and provides guidance on which operators to use when implementing randstrobes. Availability and implementation All methods and evaluation benchmarks are available in a public Github repository at https://github.com/Moein-Karami/RandStrobes. The scripts for running the strobealign analysis are found at https://github.com/NBISweden/strobealign-evaluation.

show abstract

Entropy predicts sensitivity of pseudorandom seeds

Cited by 2 publications

References 60 publications

Designing efficient randstrobes for sequence similarity analyses

Designing efficient randstrobes for sequence similarity analyses

Designing efficient randstrobes for sequence similarity analyses

Contact Info

Product

Resources

About