2011
DOI: 10.1007/978-3-642-21458-5_6
|View full text |Cite
|
Sign up to set email alerts
|

Self-indexing Based on LZ77

Abstract: We introduce the first self-index based on the Lempel-Ziv 1977 compression format (LZ77). It is particularly competitive for highly repetitive text collections such as sequence databases of genomes of related species, software repositories, versioned document collections, and temporal text databases. Such collections are extremely compressible but classical self-indexes fail to capture that source of compressibility. Our self-index takes in practice a few times the space of the text compressed with LZ77 (as li… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
60
0

Year Published

2012
2012
2016
2016

Publication Types

Select...
7
1

Relationship

3
5

Authors

Journals

citations
Cited by 60 publications
(60 citation statements)
references
References 32 publications
0
60
0
Order By: Relevance
“…Rather, we need repetition aware compression methods. Although this kind of compression is well-known (e.g., grammar-based and Ziv-Lempel-based compression), only recently there have appeared CSAs and other indexes that take advantage of repetitiveness [24][25][26][27]. Yet, those indexes do not support the full suffix tree functionality.…”
Section: Introductionmentioning
confidence: 99%
“…Rather, we need repetition aware compression methods. Although this kind of compression is well-known (e.g., grammar-based and Ziv-Lempel-based compression), only recently there have appeared CSAs and other indexes that take advantage of repetitiveness [24][25][26][27]. Yet, those indexes do not support the full suffix tree functionality.…”
Section: Introductionmentioning
confidence: 99%
“…We also use the compressed representation of P LCP [8]. Since in our case r n, we use a compressed bitmap representation that is useful for very sparse bitmaps [13]: We δ-encode the runs of 0s between consecutive 1s, and store absolute pointers to the representation of every sth 1. This is very efficient in space and solves select 1 queries in time O(s), which is the operation needed to compute a P LCP value.…”
Section: Our Repetition-aware Cstmentioning
confidence: 99%
“…We used various DNA collections from the Repetitive Corpus at PizzaChili (http://pizzachili.dcc.uchile.cl/repcorpus, created and thoroughly studied by Kreft [12]). We took DNA collections Para and Influenza, which are the most repetitive ones, and Escherichia, a less repetitive one.…”
Section: Experimental Evaluationmentioning
confidence: 99%
See 1 more Smart Citation
“…Repetitiveness is not captured by statistical compression methods nor frequency-based entropy definitions [16,24] (i.e., the frequencies of symbols do not change much if we add near-copies of an initial sequence). Rather, we need repetition aware compression methods.…”
Section: Introductionmentioning
confidence: 99%