Motivation Computing the uniqueness of k-mers for each position of a genome while allowing for up to e mismatches is computationally challenging. However, it is crucial for many biological applications such as the design of guide RNA for CRISPR experiments. More formally, the uniqueness or (k, e)-mappability can be described for every position as the reciprocal value of how often this k-mer occurs approximately in the genome, i.e. with up to e mismatches. Results We present a fast method GenMap to compute the (k, e)-mappability. We extend the mappability algorithm, such that it can also be computed across multiple genomes where a k-mer occurrence is only counted once per genome. This allows for the computation of marker sequences or finding candidates for probe design by identifying approximate k-mers that are unique to a genome or that are present in all genomes. GenMap supports different formats such as binary output, wig and bed files as well as csv files to export the location of all approximate k-mers for each genomic position. Availability and implementation GenMap can be installed via bioconda. Binaries and C++ source code are available on https://github.com/cpockrandt/genmap.
In the k-mappability problem, we are given a string x of length n and integers m and k, and we are asked to count, for each length-m factor y of x, the number of other factors of length m of x that are at Hamming distance at most k from y. We focus here on the version of the problem where k = 1. There exists an algorithm to solve this problem for k = 1 requiring time O(mn log n/ log log n) using space O(n). Here we present two new algorithms that require worst-case time O(mn) and O(n log n log log n), respectively, and space O(n), thus greatly improving the previous result. Moreover, we present another algorithm that requires average-case time and space O(n) for integer alphabets of size σ if m = (log σ n). Notably, we show that this algorithm is generalizable for arbitrary k, requiring average-case time O(kn) and space O(n) if m = (k log σ n), assuming that the letters are independent and uniformly distributed random variables. Finally, we provide an experimental evaluation of our average-case algorithm demonstrating its competitiveness to the state-of-the-art implementation.
In the Range Minimum Query (RMQ) problem, we are given an array A of n numbers and we are asked to answer queries of the following type: for indices i and j between 0 and n − 1, query RMQ A (i, j) returns the index of a minimum element in the subarray A[i . . j]. Answering a small batch of RMQs is a core computational task in many realworld applications, in particular due to the connection with the Lowest Common Ancestor (LCA) problem. With small batch, we mean that the number q of queries is o(n) and we have them all at hand. It is therefore not relevant to build an Ω(n)-sized data structure or spend Ω(n) time to build a more succinct one. It is well-known, among practitioners and elsewhere, that these data structures for online querying carry high constants in their pre-processing and querying time. We would thus like to answer this batch efficiently in practice. With efficiently in practice, we mean that we (ultimately) want to spend n + O(q) time and O(q) space. We write n to stress that the number of operations per entry of A should be a very small constant. Here we show how existing algorithms can be easily modified to satisfy these conditions. The presented experimental results highlight the practicality of this new scheme. The most significant improvement obtained is for answering a small batch of LCA queries. A library implementation of the presented algorithms is made available.
We introduce the Longest Common Circular Factor (LCCF) problem in which, given strings S and T of length n, we are to compute the longest factor of S whose cyclic shift occurs as a factor of T . It is a new similarity measure, an extension of the classic Longest Common Factor. We show how to solve the LCCF problem in O(n log 5 n) time.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.