Pegah Kamousi scite author profile

Han

et al. 2013

Proc. VLDB Endow.

Massively parallel DNA sequencing technologies are revolutionizing genomics research. Billions of short reads generated at low costs can be assembled for reconstructing the whole genomes. Unfortunately, the large memory footprint of the existing de novo assembly algorithms makes it challenging to get the assembly done for higher eukaryotes like mammals. In this work, we investigate the memory issue of constructing de Bruijn graph, a core task in leading assembly algorithms, which often consumes several hundreds of gigabytes memory for large genomes. We propose a disk-based partition method, called Minimum Substring Partitioning (MSP), to complete the task using less than 10 gigabytes memory, without runtime slowdown. MSP breaks the short reads into multiple small disjoint partitions so that each partition can be loaded into memory, processed individually and later merged with others to form a de Bruijn graph. By leveraging the overlaps among the k-mers (substring of length k), MSP achieves astonishing compression ratio: The total size of partitions is reduced from Θ(kn) to Θ(n), where n is the size of the short read database, and k is the length of a k-mer. Experimental results show that our method can build de Bruijn graphs using a commodity computer for any large-volume sequence dataset.

Stochastic minimum spanning trees in euclidean spaces

Chan²,

Suri

2011

Closest Pair and the Post Office Problem for Stochastic Points

Chan

Suri

2011

Abstract. Given a (master) set M of n points in d-dimensional Euclidean space, consider drawing a random subset that includes each point mi ∈ M with an independent probability pi. How difficult is it to compute elementary statistics about the closest pair of points in such a subset? For instance, what is the probability that the distance between the closest pair of points in the random subset is no more than , for a given value ? Or, can we preprocess the master set M such that given a query point q, we can efficiently estimate the expected distance from q to its nearest neighbor in the random subset? We obtain hardness results and approximation algorithms for stochastic problems of this kind.

Analysis of farthest point sampling for approximating geodesics in a graph

Lazard

Maheshwari

et al. 2016

Computational Geometry

A standard way to approximate the distance between any two vertices p and q on a mesh is to compute, in the associated graph, a shortest path from p to q that goes through one of k sources, which are well-chosen vertices. Precomputing the distance between each of the k sources to all vertices of the graph yields an efficient computation of approximate distances between any two vertices. One standard method for choosing k sources, which has been used extensively and successfully for isometryinvariant surface processing, is the so-called Farthest Point Sampling (FPS), which starts with a random vertex as the first source, and iteratively selects the farthest vertex from the already selected sources.In this paper, we analyze the stretch factor F F P S of approximate geodesics computed using FPS, which is the maximum, over all pairs of distinct vertices, of their approximated distance over their geodesic distance in the graph. We show that F F P S can be bounded in terms of the minimal value F * of the stretch factor obtained using an optimal placement of k sources as F F P S 2r 2 e F * + 2r 2 e + 8r e + 1, where r e is the ratio of the lengths of the longest and the shortest edges of the graph. This provides some evidence explaining why farthest point sampling has been used successfully for isometry-invariant shape processing. Furthermore, we show that it is NP-complete to find k sources that minimize the stretch factor.

Closest pair and the post office problem for stochastic points

Chan

Suri

2014

Computational Geometry