Lightweight LCP construction for very large collections of strings

Cox, A.J.; Garofalo, Fabio; Rosone, Giovanna; Sciortino, Marinella

doi:10.1016/j.jda.2016.03.003

Cited by 26 publications

(44 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…External memory LCP and BWT computation with applications n = Σ k h=1 n h . The multi-string BWT [10,25] of s 1 , . .…”

Section: :4mentioning

confidence: 99%

“…Nevertheless, the simplicity of the algorithm makes it very effective for collections of relatively short sequences, and this has become the reference tool for this problem. This approach was later extended [10] to compute also the LCP values with the same asymptotic number of I/Os. When computing also the LCP values, or when the input strings have different lengths, the algorithm uses O(m) words of RAM, where m is the number of input sequences.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

External memory BWT and LCP computation for sequence collections with applications

Egidi

Louza

Manzini

et al. 2019

Algorithms Mol Biol

View full text Add to dashboard Cite

Background Sequencing technologies produce larger and larger collections of biosequences that have to be stored in compressed indices supporting fast search operations. Many compressed indices are based on the Burrows–Wheeler Transform (BWT) and the longest common prefix (LCP) array. Because of the sheer size of the input it is important to build these data structures in external memory and time using in the best possible way the available RAM. Results We propose a space-efficient algorithm to compute the BWT and LCP array for a collection of sequences in the external or semi-external memory setting. Our algorithm splits the input collection into subcollections sufficiently small that it can compute their BWT in RAM using an optimal linear time algorithm. Next, it merges the partial BWTs in external or semi-external memory and in the process it also computes the LCP values. Our algorithm can be modified to output two additional arrays that, combined with the BWT and LCP array, provide simple, scan-based, external memory algorithms for three well known problems in bioinformatics: the computation of maximal repeats, the all pairs suffix–prefix overlaps, and the construction of succinct de Bruijn graphs. Conclusions We prove that our algorithm performs sequential I/Os, where n is the total length of the collection and is the maximum LCP value. The experimental results show that our algorithm is only slightly slower than the state of the art for short sequences but it is up to 40 times faster for longer sequences or when the available RAM is at least equal to the size of the input.

show abstract

“…External memory LCP and BWT computation with applications n = Σ k h=1 n h . The multi-string BWT [10,25] of s 1 , . .…”

Section: :4mentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

External memory BWT and LCP computation for sequence collections with applications

Egidi

Louza

Manzini

et al. 2019

Algorithms Mol Biol

View full text Add to dashboard Cite

show abstract

“…The longest common prefix (LCP) array of the collection S [30,18,24] is the array lcp(S) of length N + 1, such that lcp(S)[i], with 2 ≤ i ≤ N , is the length of the longest common prefix between the suffixes associated to the positions i and i − 1 in ebwt(S) and lcp(S)[1] = lcp(S)[N + 1] = −1 set by default. We denote by LCP(i, j) the length of the LCP between the suffixes associated with positions i and j in ebwt(S), i.e.…”

Section: Preliminariesmentioning

confidence: 99%

The Colored Longest Common Prefix Array Computed via Sequential Scans

Garofalo¹,

Rosone

Sciortino³

et al. 2018

String Processing and Information Retrieval

Self Cite

View full text Add to dashboard Cite

Due to the increased availability of large datasets of biological sequences, the tools for sequence comparison are now relying on efficient alignment-free approaches to a greater extent. Most of the alignment-free approaches require the computation of statistics of the sequences in the dataset. Such computations become impractical in internal memory when very large collections of long sequences are considered. In this paper, we present a new conceptual data structure, the colored longest common prefix array (cLCP), that allows to efficiently tackle several problems with an alignment-free approach. In fact, we show that such a data structure can be computed via sequential scans in semi-external memory. By using cLCP, we propose an efficient lightweight strategy to solve the multi-string Average Common Substring (ACS) problem, that consists in the pairwise comparison of a single string against a collection of m strings simultaneously, in order to obtain m ACS induced distances. Experimental results confirm the effectiveness of our approach.

show abstract

“…The Burrows Wheeler transform (BWT), originally introduced as a tool for data compression [4], has found application in the compact representation of many different data structures. After the seminal works [31] showing that the BWT can be used as a compressed full text index for a single string, many researchers have proposed variants of this transformation for string collections [5,24], trees [9,10], graphs [3,27,35], and alignments [30,29]. See [13] for an attempt to provide a unified view of these variants.…”

Section: Introductionmentioning

confidence: 99%

“…Historically, the first of such generalizations is the circular BWT [24] considered in Section 6. Here we consider the generalization proposed in [5] which is the one most used in applications. Let t 0 [1, n 0 ] and t 1 [1, n 1 ] be such that t 0 [n 0 ] = $ 0 and t 1 [n 1 ] = $ 1 where $ 0 < $ 1 are two symbols not appearing elsewhere in t 0 and t 1 and smaller than any other symbol.…”

Section: Introductionmentioning

confidence: 99%

Lightweight merging of compressed indices based on BWT variants

Egidi

Manzini

2020

Theoretical Computer Science

View full text Add to dashboard Cite

In this paper we propose a flexible and lightweight technique for merging compressed indices based on variants of Burrows-Wheeler transform (BWT), thus addressing the need for algorithms that compute compressed indices over large collections using a limited amount of working memory. Merge procedures make it possible to use an incremental strategy for building large indices based on merging indices for progressively larger subcollections.Starting with a known lightweight algorithm for merging BWTs [Holt and McMillan, Bionformatics 2014], we show how to modify it in order to merge, or compute from scratch, also the Longest Common Prefix (LCP) array. We then expand our technique for merging compressed tries and circular/permuterm compressed indices, two compressed data structures for which there were hitherto no known merging algorithms. ACM Subject Classification Theory of computation → Design and analysis of algorithms

show abstract

Lightweight LCP construction for very large collections of strings

Cited by 26 publications

References 25 publications

External memory BWT and LCP computation for sequence collections with applications

External memory BWT and LCP computation for sequence collections with applications

The Colored Longest Common Prefix Array Computed via Sequential Scans

Lightweight merging of compressed indices based on BWT variants

Contact Info

Product

Resources

About