Massively parallel DNA sequencing technologies are revolutionizing genomics research. Billions of short reads generated at low costs can be assembled for reconstructing the whole genomes. Unfortunately, the large memory footprint of the existing de novo assembly algorithms makes it challenging to get the assembly done for higher eukaryotes like mammals. In this work, we investigate the memory issue of constructing de Bruijn graph, a core task in leading assembly algorithms, which often consumes several hundreds of gigabytes memory for large genomes. We propose a disk-based partition method, called Minimum Substring Partitioning (MSP), to complete the task using less than 10 gigabytes memory, without runtime slowdown. MSP breaks the short reads into multiple small disjoint partitions so that each partition can be loaded into memory, processed individually and later merged with others to form a de Bruijn graph. By leveraging the overlaps among the k-mers (substring of length k), MSP achieves astonishing compression ratio: The total size of partitions is reduced from Θ(kn) to Θ(n), where n is the size of the short read database, and k is the length of a k-mer. Experimental results show that our method can build de Bruijn graphs using a commodity computer for any large-volume sequence dataset.
Abstract. Given a (master) set M of n points in d-dimensional Euclidean space, consider drawing a random subset that includes each point mi ∈ M with an independent probability pi. How difficult is it to compute elementary statistics about the closest pair of points in such a subset? For instance, what is the probability that the distance between the closest pair of points in the random subset is no more than , for a given value ? Or, can we preprocess the master set M such that given a query point q, we can efficiently estimate the expected distance from q to its nearest neighbor in the random subset? We obtain hardness results and approximation algorithms for stochastic problems of this kind.
A standard way to approximate the distance between any two vertices p and q on a mesh is to compute, in the associated graph, a shortest path from p to q that goes through one of k sources, which are well-chosen vertices. Precomputing the distance between each of the k sources to all vertices of the graph yields an efficient computation of approximate distances between any two vertices. One standard method for choosing k sources, which has been used extensively and successfully for isometryinvariant surface processing, is the so-called Farthest Point Sampling (FPS), which starts with a random vertex as the first source, and iteratively selects the farthest vertex from the already selected sources.In this paper, we analyze the stretch factor F F P S of approximate geodesics computed using FPS, which is the maximum, over all pairs of distinct vertices, of their approximated distance over their geodesic distance in the graph. We show that F F P S can be bounded in terms of the minimal value F * of the stretch factor obtained using an optimal placement of k sources as F F P S 2r 2 e F * + 2r 2 e + 8r e + 1, where r e is the ratio of the lengths of the longest and the shortest edges of the graph. This provides some evidence explaining why farthest point sampling has been used successfully for isometry-invariant shape processing. Furthermore, we show that it is NP-complete to find k sources that minimize the stretch factor.
Abstract. Given a (master) set M of n points in d-dimensional Euclidean space, consider drawing a random subset that includes each point mi ∈ M with an independent probability pi. How difficult is it to compute elementary statistics about the closest pair of points in such a subset? For instance, what is the probability that the distance between the closest pair of points in the random subset is no more than , for a given value ? Or, can we preprocess the master set M such that given a query point q, we can efficiently estimate the expected distance from q to its nearest neighbor in the random subset? We obtain hardness results and approximation algorithms for stochastic problems of this kind.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.