Minimizer schemes have found widespread use in genomic applications as a way to quickly predict the matching probability of large sequences. Most methods for minimizer schemes use randomized (or close to randomized) ordering of k-mers when finding minimizers, but recent work has shown that not all non-lexicographic orderings perform the same. One way to find k-mer orderings for minimizer schemes is through the use of universal k-mer sets, which are subsets of k-mers that are guaranteed to cover all windows. The smaller this set the fewer false positives (where two poorly aligned sequences are labeled as possible matches) are identified. Current methods for creating universal k-mer sets are limited in the length of the k-mer that can be considered, and cannot compute sets in the range of lengths currently used in practice. We take some of the first steps in creating universal k-mer sets that can be used to construct minimizer orders for large values of k that are practical. We do this using iterative extension of the k-mers in a set, and guided contraction of the set itself. We also show that this process will be guaranteed to never increase the number of distinct minimizers chosen in a sequence, and thus can only decrease the number of false positives over using the current sets on small k-mers. CCS CONCEPTS • Theory of computation → Sketching and sampling; • Applied computing → Computational genomics; Bioinformatics;
Minimizer schemes have found widespread use in genomic applications as a way to quickly predict the matching probability of large sequences. Most methods for minimizer schemes use randomized (or close to randomized) ordering of k-mers when finding minimizers, but recent work has shown that not all non-lexicographic orderings perform the same. One way to find k-mer orderings for minimizer schemes is through the use of universal kmer sets, which are subsets of k-mers that are guaranteed to cover all windows. The smaller this set the fewer false positives (where two poorly aligned sequences being identified as possible matches) are identified. Current methods for creating universal k-mer sets are limited in the length of the k-mer that can be considered, and cannot compute sets in the range of lengths currently used in practice. We take some of the first steps in creating universal k-mer sets that can be used to construct minimizer orders for large values of k that are practical. We do this using iterative extension of the k-mers in a set, and guided contraction of the set itself. We also show that this process will be guaranteed to never increase the number of distinct minimizers chosen in a sequence, and thus can only decrease the number of false positives over using the current sets on small k-mers. * Work performed as part of the Internship in Biomedical Research, Informatics, and Computer Science (iBRIC) at the University of Pittsburgh † Corresponding Author that two reads sharing a significant amount of sequence must be binned together, therefore avoiding false negatives (missed overlaps).It is generally beneficial to select as few k-mers as possible from the sequence. For example, in the case of overlap computation, this leads to smaller bins and less computation, and in the case of sparse data structures, fewer selected k-mers imply a sparser data structure. The density is the measure of the number of selected k-mers over the length of the sequence (see Section 2.1) and a lower density is desirable.The minimizers method is rather a family of methods parameterized by the length k of the k-mers, the length w of the windows, and the order imposed on the k-mer to select the smallest k-mer in each window. Generally the parameters k and w are constrained by the application itself. By contrast, the order on the k-mer is a "free" parameter: regardless of the choice of the order, the two properties above are satisfied, and the algorithm is correct for any order.Although any choice of order leads to correct result, the order has a significant influence on the expected density of selected k-mers. Therefore, the choice of order with lower density leads to better performance for applications using minimizers. Finding orders with low density will improve future applications and, because any order satisfy properties (1) and (2) above, these improved orders could be retrofitted into existing applications.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.