Background
Sequencing technologies produce larger and larger collections of biosequences that have to be stored in compressed indices supporting fast search operations. Many compressed indices are based on the Burrows–Wheeler Transform (BWT) and the longest common prefix (LCP) array. Because of the sheer size of the input it is important to build these data structures in external memory and time using in the best possible way the available RAM.
Results
We propose a space-efficient algorithm to compute the BWT and LCP array for a collection of sequences in the external or semi-external memory setting. Our algorithm splits the input collection into subcollections sufficiently small that it can compute their BWT in RAM using an optimal linear time algorithm. Next, it merges the partial BWTs in external or semi-external memory and in the process it also computes the LCP values. Our algorithm can be modified to output two additional arrays that, combined with the BWT and LCP array, provide simple, scan-based, external memory algorithms for three well known problems in bioinformatics: the computation of maximal repeats, the all pairs suffix–prefix overlaps, and the construction of succinct de Bruijn graphs.
Conclusions
We prove that our algorithm performs
sequential I/Os, where
n
is the total length of the collection and
is the maximum LCP value. The experimental results show that our algorithm is only slightly slower than the state of the art for short sequences but it is up to 40 times faster for longer sequences or when the available RAM is at least equal to the size of the input.
The functional fragment of Landin’s ISWIM as implemented by the SECD machine is the paradigm of the procedural kernel of many programming languages. We investigate and compare operational, denotational and logical descriptions of the ISWIM-SECD system. Our goal is to illustrate how to derive from each of these descriptions logical tools for resoning about termination and equivalence of programs. First we show the correctness and incompleteness of the canonical denotational semantics. Then we give a fully abstract quotient semantics using a notion of applicative bisimulation. We discuss next a finitary logical description of the denotational semantics. This takes the form of a call-by-value intersection type assignment system. Finally we study this type assignment system for its own sake and give a completeness result for it with respect to a natural notion of interpretation.
Recently, Holt and McMillan [Bionformatics 2014, ACM-BCB 2014] have proposed a simple and elegant algorithm to merge the Burrows-Wheeler transforms of a collection of strings. In this paper we show that their algorithm can be improved so that, in addition to the BWTs, it also merges the Longest Common Prefix (LCP) arrays. Because of its small memory footprint this new algorithm can be used for the final merge of BWT and LCP arrays computed by a faster but memory intensive construction algorithm.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.