Time–space trade-offs for Lempel–Ziv compressed indexing

Indexing highly repetitive texts -such as genomic databases, software repositories and versioned text collections -has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is r, the number of runs in their Burrows-Wheeler Transforms (BWTs). One of the earliest indexes for repetitive collections, the Run-Length FM-index, used O(r) space and was able to efficiently count the number of occurrences of a pattern of length m in the text (in loglogarithmic time per pattern symbol, with current techniques). However, it was unable to locate the positions of those occurrences efficiently within a space bounded in terms of r. Since then, a number of other indexes with space bounded by other measures of repetitivenessthe number of phrases in the Lempel-Ziv parse, the size of the smallest grammar generating (only) the text, the size of the smallest automaton recognizing the text factors -have been proposed for efficiently locating, but not directly counting, the occurrences of a pattern. In this paper we close this long-standing problem, showing how to extend the Run-Length FM-index so that it can locate the occ occurrences efficiently within O(r) space (in loglogarithmic time each), and reaching optimal time, O(m + occ), within O(r log log w (σ + n/r)) space, for a text of length n over an alphabet of size σ on a RAM machine with words of w = Ω(log n) bits. Within that space, our index can also count in optimal time, O(m). Multiplying the space by O(w/ log σ), we support count and locate in O( m log(σ)/w ) and O( m log(σ)/w + occ) time, which is optimal in the packed setting and had not been obtained before in compressed space. We also describe a structure using O(r log(n/r)) space that replaces the text and extracts any text substring of length in almost-optimal time O(log(n/r) + log(σ)/w). Within that space, we similarly provide direct access to suffix array, inverse suffix array, and longest common prefix array cells, and extend these capabilities to full suffix tree functionality, typically in O(log(n/r)) time per operation. Our experiments show that our O(r)-space index outperforms the space-competitive alternatives by 1-2 orders of magnitude.

show abstract

“…4. Self-indexes with efficient extraction require Ω(z log(n/z)) space [105,21,43,10,15], Ω(g) space [17,14], or Ω(e) space [111,7]. 5.…”

Section: Indexmentioning

confidence: 99%

Fully Functional Suffix Trees and Optimal Text Searching in BWT-Runs Bounded Space

Gagie

Navarro

Prezza

2020

J. ACM

128

142

View full text Add to dashboard Cite

show abstract

“…On the other hand, there exists a "universal" set Γ P,i of 4(k − 1) positions within the occurrence P i in T C that covers all substrings of P i of length ≤ k. 8 In particular, Γ P,i covers the strings x i for j ∈ {2, . .…”

Section: Computational Complexitymentioning

confidence: 99%

“…Extracting text from Lempel-Ziv compressed text is a notoriously difficult problem. No efficient solution is known within O(z) space (they all require time proportional to the parse's height), although efficient queries can be supported by raising the space by a logarithmic factor [8,6]. Grammars, on the other hand, allow for more compact and timeefficient extraction strategies.…”

mentioning

confidence: 99%

At the roots of dictionary compression: string attractors

Kempa

Prezza

2018

Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing

103

133

View full text Add to dashboard Cite

A well-known fact in the field of lossless text compression is that high-order entropy is a weak model when the input contains long repetitions. Motivated by this fact, decades of research have generated myriads of socalled dictionary compressors: algorithms able to reduce the text's size by exploiting its repetitiveness. Lempel-Ziv 77 is one of the most successful and well-known tools of this kind, followed by straight-line programs, run-length Burrows-Wheeler transform, macro schemes, collage systems, and the compact directed acyclic word graph. In this paper, we show that these techniques are different solutions to the same, elegant, combinatorial problem: to find a small set of positions capturing all distinct text's substrings. We call such a set a string attractor. We first show reductions between dictionary compressors and string attractors. This gives the approximation ratios of dictionary compressors with respect to the smallest string attractor and allows us to uncover new asymptotic relations between the output sizes of different dictionary compressors. We then show that the k-attractor problem -deciding whether a text has a size-t set of positions capturing all substrings of length at most kis NP-complete for k ≥ 3. This, in particular, includes the full string attractor problem. We provide several approximation techniques for the smallest k-attractor, show that the problem is APX-complete for constant k, and give strong inapproximability results. To conclude, we provide matching lower and upper bounds for the random access problem on string attractors. The upper bound is proved by showing a data structure supporting queries in optimal time. Our data structure is universal: by our reductions to string attractors, it supports random access on any dictionary-compression scheme. In particular, it matches the lower bound also on LZ77, straightline programs, collage systems, and macro schemes, and therefore essentially closes (at once) the random access problem for all these compressors.the name of straight-line programs (SLP) [26]; an SLP is a set of rules of the kind X → AB or X → a, where X, A, and B are nonterminals and a is a terminal. The string is obtained from the expansion of a single starting nonterminal S. If also rules of the form X → A ℓ are allowed, for any ℓ > 2, then the grammar is called run-length SLP (RLSLP) [36]. The problems of finding the smallest SLP -of size g * -and the smallest run-length SLP -of size g * rl -are NP-hard [12,23], but fast and effective approximation algorithms are known, e.g., LZ78 [46], LZW [44], Re-Pair [31], Bisection [27]. An even more powerful generalization of RLSLPs is represented by collage systems [25]: in this case, also rules of the form X → Y [l..r] are allowed (i.e., X expands to a substring of Y ). We denote with c the size of a generic collage system, and with c * the size of the smallest one. A related strategy, more powerful than grammar compression, is that of replacing repetitions with pointers to other locations in the string. The most powerful...

show abstract

“…see e.g. [10,3,9,8,17,16,13]. This problem is highly relevant as the amount of highlyrepetitive data increases rapidly, and thus it is possible to handle greater amounts of data by compressing it.…”

Section: Introductionmentioning

confidence: 99%

“…Bille et al [3] O(z(lg(n/z) + lg ǫ z)) O(m + occ(lg ǫ n + lg lg n)) n O(1) Bille et al [3] O(z lg(n/z) lg lg z) O(m + occ lg lg n) O(1) Bille et al [3] O(z lg(n/z)) O(m(1 + lg ǫ z lg(n/z) ) + occ(lg ǫ n + lg lg n)) O(1)…”

Section: Introductionunclassified

Compressed Indexing with Signature Grammars

Christiansen

Ettienne

2018

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

The compressed indexing problem is to preprocess a string S of length n into a compressed representation that supports pattern matching queries. That is, given a string P of length m report all occurrences of P in S. We present a data structure that supports pattern matching queries in O(m + occ(lg lg n + lg ǫ z)) time using O(z lg(n/z)) space where z is the size of the LZ77 parse of S and ǫ > 0 is an arbitrarily small constant, when the alphabet is small or z = O(n 1−δ ) for any constant δ > 0. We also present two data structures for the general case; one where the space is increased by O(z lg lg z), and one where the query time changes from worst-case to expected. These results improve the previously best known solutions. Notably, this is the first data structure that decides if P occurs in S in O(m) time using O(z lg(n/z)) space. Our results are mainly obtained by a novel combination of a randomized grammar construction algorithm with well known techniques relating pattern matching to 2D-range reporting.

show abstract

Time–space trade-offs for Lempel–Ziv compressed indexing

Cited by 42 publications

References 37 publications

Fully Functional Suffix Trees and Optimal Text Searching in BWT-Runs Bounded Space

Fully Functional Suffix Trees and Optimal Text Searching in BWT-Runs Bounded Space

At the roots of dictionary compression: string attractors

Compressed Indexing with Signature Grammars

Contact Info

Product

Resources

About