Universal hitting sets are sets of words that are unavoidable: every long enough sequence is hit by the set (i.e., it contains a word from the set). There is a tight relationship between universal hitting sets and minimizers schemes, where minimizers schemes with low density (i.e., efficient schemes) correspond to universal hitting sets of small size. Local schemes are a generalization of minimizers schemes which can be used as replacement for minimizers scheme with the possibility of being much more efficient. We establish the link between efficient local schemes and the minimum length of a string that must be hit by a universal hitting set. We give bounds for the remaining path length of the Mykkeltveit universal hitting set. Additionally, we create a local scheme with the lowest known density that is only a log factor away from the theoretical lower bound.
Lower density selection schemesH. Zheng et al. a minimizers scheme with a reduced density leads to a smaller database and fewer locations to consider, hence an increased efficiency, while preserving the accuracy.There is a two-way correspondence between minimizers methods and universal hitting sets: each minimizers method has a corresponding UHS, and a UHS defines a family of compatible minimizers methods [9,10]. The remaining path length of the UHS is upper-bounded by the number of bases in each window in the minimizers scheme (L ≤ w + k − 1). Moreover, the relative size of the UHS, defined as the size of UHS over the number of possible k-mers, provides an upper-bound on the density of the corresponding minimizers methods: the density is no more than the relative size of the universal hitting set. Precisely, 1 w ≤ d ≤ |U | σ k , where d is the density, U is the universal hitting set, σ k is the total number of k-mers on an alphabet of size σ, and w is the window length. In other words, the study of universal hitting sets with small size leads to the creation of minimizers methods with provably low density.Local schemes [12] and forward schemes are generalizations of minimizers schemes. These extensions are of interest because they can be used in place of minimizers schemes while sampling k-mers with lower density. In particular, minimizers schemes cannot have density close to the theoretical lower bound of 1/w when w becomes large, while local and forward schemes do not suffer from this limitation [9]. Understanding how to design local and forward schemes with low density will allow us to further improve the computation efficiency of many bioinformatics algorithms.The previously known link between minimizers schemes and UHS relied on the definition of an ordering between k-mers, and therefore is not valid for local and forward scheme that are not based on any ordering. Nevertheless, UHSs play a central role in understanding the density of local and forward schemes.Our first contribution is to describe the connection between UHSs, local and forward schemes. More precisely, there are two connections: first between the density of the schemes and the relative s...