Designing small universal k-mer hitting sets for improved analysis of high-throughput sequencing

Orenstein, Yaron; Pellow, David; Marçais, Guillaume; Shamir, Ron; Kingsford, Carl

doi:10.1371/journal.pcbi.1005777

Cited by 56 publications

(69 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Furthermore, there are applications where the k-mer set is not related to sequence read data at all, e.g. a universal hitting set [26], a chromosome-specific reference dictionary [27], or a winnowed min-hash sketch (for example as in [28], or see [29,30] for a survey).…”

Section: Related Workmentioning

confidence: 99%

Representation ofk-mer sets using spectrum-preserving string sets

Rahman

Medvedev

2020

Preprint

View full text Add to dashboard Cite

Given the popularity and elegance of k-mer based tools, finding a space-efficient way to represent a set of k-mers is important for improving the scalability of bioinformatics analyses. One popular approach is to convert the set of k-mers into the more compact set of unitigs. We generalize this approach and formulate it as the problem of finding a smallest spectrum-preserving string set (SPSS) representation. We show that this problem is equivalent to finding a smallest path cover in a compacted de Bruijn graph. Using this reduction, we prove a lower bound on the size of the optimal SPSS and propose a greedy method called UST that results in a smaller representation than unitigs and is nearly optimal with respect to our lower bound. We demonstrate the usefulness of the SPSS formulation with two applications of UST. The first one is a compression algorithm, UST-Compress, which we show can store a set of k-mers using an order-of-magnitude less disk space than other lossless compression tools. The second one is an exact static k-mer membership index, UST-FM, which we show improves index size by 10-44% compared to other state-of-the-art low memory indices. Our tool is publicly available at: https://github.com/medvedevgroup/UST/.

show abstract

Section: Related Workmentioning

confidence: 99%

Representation ofk-mer sets using spectrum-preserving string sets

Rahman

Medvedev

2020

Preprint

View full text Add to dashboard Cite

show abstract

“…Universal k-mer sets are central to the construction of orders with low density [11]. In fact, the proposed method, just like DOCKS [16,15], is a heuristics to construct universal sets.…”

Section: Universal Sets and Compatible Ordersmentioning

confidence: 99%

“…The problem of finding an optimal order, i.e., an order with the lowest possible density, is still open [13]. Orenstein et al [16] proposed a heuristic, DOCKS, that is used to create orders with low density. Unfortunately, this method has a compute time that is over exponential in k and is impractical for k ≥ 10.…”

Section: Introductionmentioning

confidence: 99%

Practical universalk-mer sets for minimizer schemes

DeBlasio

Gbosibo

Kingsford

et al. 2019

Preprint

Self Cite

View full text Add to dashboard Cite

Minimizer schemes have found widespread use in genomic applications as a way to quickly predict the matching probability of large sequences. Most methods for minimizer schemes use randomized (or close to randomized) ordering of k-mers when finding minimizers, but recent work has shown that not all non-lexicographic orderings perform the same. One way to find k-mer orderings for minimizer schemes is through the use of universal kmer sets, which are subsets of k-mers that are guaranteed to cover all windows. The smaller this set the fewer false positives (where two poorly aligned sequences being identified as possible matches) are identified. Current methods for creating universal k-mer sets are limited in the length of the k-mer that can be considered, and cannot compute sets in the range of lengths currently used in practice. We take some of the first steps in creating universal k-mer sets that can be used to construct minimizer orders for large values of k that are practical. We do this using iterative extension of the k-mers in a set, and guided contraction of the set itself. We also show that this process will be guaranteed to never increase the number of distinct minimizers chosen in a sequence, and thus can only decrease the number of false positives over using the current sets on small k-mers. * Work performed as part of the Internship in Biomedical Research, Informatics, and Computer Science (iBRIC) at the University of Pittsburgh † Corresponding Author that two reads sharing a significant amount of sequence must be binned together, therefore avoiding false negatives (missed overlaps).It is generally beneficial to select as few k-mers as possible from the sequence. For example, in the case of overlap computation, this leads to smaller bins and less computation, and in the case of sparse data structures, fewer selected k-mers imply a sparser data structure. The density is the measure of the number of selected k-mers over the length of the sequence (see Section 2.1) and a lower density is desirable.The minimizers method is rather a family of methods parameterized by the length k of the k-mers, the length w of the windows, and the order imposed on the k-mer to select the smallest k-mer in each window. Generally the parameters k and w are constrained by the application itself. By contrast, the order on the k-mer is a "free" parameter: regardless of the choice of the order, the two properties above are satisfied, and the algorithm is correct for any order.Although any choice of order leads to correct result, the order has a significant influence on the expected density of selected k-mers. Therefore, the choice of order with lower density leads to better performance for applications using minimizers. Finding orders with low density will improve future applications and, because any order satisfy properties (1) and (2) above, these improved orders could be retrofitted into existing applications.

show abstract

“…The first one computes a set of k-mers that covers every path of length w in the de Bruijn graph (an extension of the set cover problem). This problem was studied in Orenstein et al (2017) and Marçais et al (2017), and this new algorithm gives an asymptotically optimal solution. The second algorithm gives the order between k-mers for the minimizers schemes in (I).…”

Section: Introductionmentioning

confidence: 99%

Asymptotically optimal minimizers schemes

Marçais

DeBlasio

Kingsford

2018

Preprint

Self Cite

View full text Add to dashboard Cite

Motivation: The minimizers technique is a method to sample k-mers that is used in many bioinformatics software to reduce computation, memory usage and run time. The number of applications using minimizers keeps on growing steadily. Despite its many uses, the theoretical understanding of minimizers is still very limited. In many applications, selecting as few k-mers as possible (i.e. having a low density) is beneficial. The density is highly dependent on the choice of the order on the k-mers. Different applications use different orders, but none of these orders are optimal. A better understanding of minimizers schemes, and the related local and forward schemes, will allow designing schemes with lower density, and thereby making existing and future bioinformatics tools even more efficient.Results: From the analysis of the asymptotic behavior of minimizers, forward and local schemes, we show that the previously believed lower bound on minimizers schemes does not hold, and that schemes with density lower than thought possible actually exist. The proof is constructive and leads to an efficient algorithm to compare k-mers. These orders are the first known orders that are asymptotically optimal. Additionally, we give improved bounds on the density achievable by the 3 type of schemes. Contact:

show abstract

Designing small universal k-mer hitting sets for improved analysis of high-throughput sequencing

Cited by 56 publications

References 19 publications

Representation ofk-mer sets using spectrum-preserving string sets

Representation ofk-mer sets using spectrum-preserving string sets

Practical universalk-mer sets for minimizer schemes

Asymptotically optimal minimizers schemes

Contact Info

Product

Resources

About