Efficient minimizer orders for large values of<i>k</i>using minimum decycling sets

Pellow, David; Pu, Lianrong; Ekim, Barış; Kotlar, Lior; Berger, Bonnie; Shamir, Ron; Orenstein, Yaron

doi:10.1101/2022.10.18.512682

Cited by 4 publications

(4 citation statements)

References 44 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Through the application of “random minimizers”, which employ a hashing function, it is estimated that the number of selected minimizers required is twice the minimal theoretical number. Nevertheless, by adopting advanced minimizer selection algorithms [41, 31], onecan surpass these expectations and further reduce the number of selected minimizers in practice.…”

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

K2R: Tinted de Bruijn Graphs implementation for efficient read extraction from sequencing datasets

Vandamme,

Cazaux,

Limasset

2024

Preprint

View full text Add to dashboard Cite

The study of biological sequences often relies on using reference genomes, yet achieving accurate assemblies remains challenging. Consequently, de novo analysis directly from raw reads, without pre-processing, is frequently more practical. We identify a very commonly shared need across various applications: identifying reads containing a specific kmer in a dataset. This kmer-to-reads association would be pivotal in multiple contexts, including genotyping, bacterial strain resolution, profiling, data compression, error correction or assembly. While this challenge appears similar to the extensively researched colored de Bruijn graph problem, resolving it at the read level would be prohibitively resource-intensive in practical applications. In this work, we demonstrate its tractable resolution by leveraging certain assumptions for sequencing dataset indexing. To tackle this challenge, we introduce the Tinted de Bruijn Graph concept, an altered version of the colored de Bruijn graph where each read within a sequencing dataset represents a unique source. We developed K2R, a highly scalable index that implement such search efficiently within this framework. K2Rs performance, in terms of index size, memory footprint, throughput, and construction time, is benchmarked against leading methods, including hashing techniques (e.g., Short Read Connector) and full-text indexing (e.g., Spumoni and Movi), across various datasets. K2R consistently outperforms contemporary solutions in most metrics and is the only tool capable of scaling to larger datasets. To prove K2R scalability we indexed two human datasets of the T2T consortium: the 126X coverage ONT dataset was indexed in 18hours using 19GB of RAM for a final index of 9.5GB and the 56X coverage HiFi dataset was constructed in 90 minutes using 5Gb of RAM for a final index of 207Mb. The K2R index, developed in C++, is open source and available on Github github.com/LeaVandamme/K2R.

show abstract

Section: Methodsmentioning

confidence: 99%

“…Recent advancements in minimizer selection techniques aim to closely approach this theoretical lower bound, thus reducing the quantity of necessary minimizers. We employ decycling set minimizers [31], which minimize the count of selected minimizers, albeit at the cost of increased computational overhead.…”

Section: Minimizer Schemementioning

confidence: 99%

K2R: Tinted de Bruijn Graphs implementation for efficient read extraction from sequencing datasets

Vandamme,

Cazaux,

Limasset

2024

Preprint

View full text Add to dashboard Cite

show abstract

“…This union function is not entirely new. Although motivated by a different goal, it is suggested in [18] to use an order for minimizers which is based on the set φ u , where φ is the Mykkeltveit set.…”

Section: The Union Set and Sparse Canonicalizationmentioning

confidence: 99%

“…The Mykkeltveit and Champarnaud sets are two known construction methods for decycling sets of minimum size. Although these sets are not used on their own as sketching methods, the Mykkeltveit set in particular has been used as a starting point to define sketching methods [17, 16, 5, 18]. By construction, these sets are decycling.…”

Section: Decycling In K-nonical Spacementioning

confidence: 99%

k-nonical space: sketching with reverse complements

Marçais,

Elder,

Kingsford

2024

Preprint

View full text Add to dashboard Cite

Sequences equivalent to their reverse complements (i.e., double-stranded DNA) have no analogue in text analysis and non-biological string algorithms. Despite this striking difference, algorithms designed for computational biology (e.g., sketching algorithms) are designed and tested in the same way as classical string algorithms. Then, as a post-processing step, these algorithms are adapted to work with genomic sequences by folding a k-mer and its reverse complement into a single sequence: the canonical representation (k-nonical space). The effect of using the canonical representation with sketching methods is understudied and not understood. As a first step, we use context-free sketching methods to illustrate the potentially detrimental effects of using canonical k-mers with string algorithms not designed to accommodate for them. In particular, we show that large stretches of the genome ("sketching deserts") are undersampled or entirely skipped by context-free sketching methods, effectively making these genomic regions invisible to subsequent algorithms using these sketches. We provide empirical data showing these effects and develop a theoretical framework explaining the appearance of sketching deserts. Finally, we propose two schemes to accommodate for these effects: (1) a new procedure that adapts existing sketching methods to k-nonical space and (2) an optimization procedure to directly design new sketching methods for k-nonical space. The code used in this analysis is freely available at https://github.com/Kingsford-Group/mdsscope. Keywords: sketching, reverse complement, canonical k-mer

show abstract