Practical Universal k-mer Sets for Minimizer Schemes

Lecture Notes in Computer Science

Marçais

2020

Self Cite

Universal hitting sets are sets of words that are unavoidable: every long enough sequence is hit by the set (i.e., it contains a word from the set). There is a tight relationship between universal hitting sets and minimizers schemes, where minimizers schemes with low density (i.e., efficient schemes) correspond to universal hitting sets of small size. Local schemes are a generalization of minimizers schemes which can be used as replacement for minimizers scheme with the possibility of being much more efficient. We establish the link between efficient local schemes and the minimum length of a string that must be hit by a universal hitting set. We give bounds for the remaining path length of the Mykkeltveit universal hitting set. Additionally, we create a local scheme with the lowest known density that is only a log factor away from the theoretical lower bound. Lower density selection schemesH. Zheng et al. a minimizers scheme with a reduced density leads to a smaller database and fewer locations to consider, hence an increased efficiency, while preserving the accuracy.There is a two-way correspondence between minimizers methods and universal hitting sets: each minimizers method has a corresponding UHS, and a UHS defines a family of compatible minimizers methods [9,10]. The remaining path length of the UHS is upper-bounded by the number of bases in each window in the minimizers scheme (L ≤ w + k − 1). Moreover, the relative size of the UHS, defined as the size of UHS over the number of possible k-mers, provides an upper-bound on the density of the corresponding minimizers methods: the density is no more than the relative size of the universal hitting set. Precisely, 1 w ≤ d ≤ |U | σ k , where d is the density, U is the universal hitting set, σ k is the total number of k-mers on an alphabet of size σ, and w is the window length. In other words, the study of universal hitting sets with small size leads to the creation of minimizers methods with provably low density.Local schemes [12] and forward schemes are generalizations of minimizers schemes. These extensions are of interest because they can be used in place of minimizers schemes while sampling k-mers with lower density. In particular, minimizers schemes cannot have density close to the theoretical lower bound of 1/w when w becomes large, while local and forward schemes do not suffer from this limitation [9]. Understanding how to design local and forward schemes with low density will allow us to further improve the computation efficiency of many bioinformatics algorithms.The previously known link between minimizers schemes and UHS relied on the definition of an ordering between k-mers, and therefore is not valid for local and forward scheme that are not based on any ordering. Nevertheless, UHSs play a central role in understanding the density of local and forward schemes.Our first contribution is to describe the connection between UHSs, local and forward schemes. More precisely, there are two connections: first between the density of the schemes and the relative s...

Section: Introductionmentioning

confidence: 99%

Lower Density Selection Schemes via Small Universal Hitting Sets with Short Remaining Path Length

Lecture Notes in Computer Science

Marçais

2020

Self Cite

“…The idea of learning minimizer schemes tailored towards a target sequence has been previously explored, although to a lesser extent. Current approaches include heuristic designs [1, 8], greedy pruning [2] and construction of k -mer sets that are well-spread on the target sequence [20]. However, these methods only learn crude approximations of π by dividing k -mers into disjoint subsets with different priorities to be selected.…”

Section: Introductionmentioning

confidence: 99%

DeepMinimizer: A Differentiable Framework for Optimizing Sequence-Specific Minimizer Schemes

Hoang

2022

Preprint

Self Cite

Minimizers are k-mer sampling schemes designed to generate sketches for large sequences that preserve sufficiently long matches between sequences. Despite their widespread application, learning an effective minimizer scheme with optimal sketch size is still an open question. Most work in this direction focuses on designing schemes that work well on expectation over random sequences, which have limited applicability to many practical tools. On the other hand, several methods have been proposed to construct minimizer schemes for a specific target sequence. These methods, however, require greedy approximations to solve an intractable discrete optimization problem on the permutation space of k-mer orderings. To address this challenge, we propose: (a) a reformulation of the combinatorial solution space using a deep neural network reparameterization; and (b) a fully differentiable approximation of the discrete objective. We demonstrate that our framework, DeepMinimizer, discovers minimizer schemes that significantly outperform state-of-the-art constructions on genomic sequences.

“…The idea of constructing sequence sketches tailored to a specific sequence has been explored before (Chikhi et al ., 2015; DeBlasio et al ., 2019; Jain et al ., 2020b), but it remains less understood than the average case. Random sequences have nice properties that allow for simplified probabilistic analysis.…”

Section: Introductionmentioning

confidence: 99%

Sequence-specific minimizers via polar sets

Marçais

2021

Preprint

Self Cite

Minimizers are efficient methods to sample k-mers from genomic sequences that unconditionally preserve sufficiently long matches between sequences. Well-established methods to construct efficient minimizers focus on sampling fewer k-mers on a random sequence and use universal hitting sets (sets of k-mers that appear frequently enough) to upper bound the sketch size. In contrast, the problem of sequence-specific minimizers, which is to construct efficient minimizers to sample fewer k-mers on a specific sequence such as the reference genome, is less studied. Currently, the theoretical understanding of this problem is lacking, and existing methods do not specialize well to sketch specific sequences. We propose the concept of polar sets, complementary to the existing idea of universal hitting sets. Polar sets are k-mer sets that are spread out enough on the reference, and provably specialize well to specific sequences. Link energy measures how well spread out a polar set is, and with it, the sketch size can be bounded from above and below in a theoretically sound way. This allows for direct optimization of sketch size. We propose efficient heuristics to construct polar sets, and via experiments on the human reference genome, show their practical superiority in designing efficient sequence-specific minimizers. A reference implementation and code for analyses under an open-source license are at https://github.com/kingsford-group/polarset.