Data compression with long repeated strings

Bentley, Jon Louis; McIlroy, D.

doi:10.1016/s0020-0255(01)00097-4

Cited by 20 publications

(16 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Table 5 contains the results of the experiment similar to the experiment described by Bentley and McIlroy in Ref. [11]. It confirms that PPMII compresses much better than gzip.…”

Section: Complex Gain Functionsupporting

confidence: 69%

“…In the paper of Bentley and McIlroy [11], we can find description of a very good algorithm, which has been created to find long repeated strings. This is a preprocessing algorithm, which can interact with many known compression algorithms.…”

Section: Finding Long Repeated Stringsmentioning

confidence: 99%

“…The following study is based on a work of Bentley and McIlroy [11], which presents an idea of finding long repeated strings in a file. The algorithm, demonstrated here, is not limited to text files, since neither the PPM algorithm nor the algorithm proposed by Bentley and McIlroy are limited to text files.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

PPM with the extended alphabet

Skibiński

2006

Information Sciences

View full text Add to dashboard Cite

“…Table 5 contains the results of the experiment similar to the experiment described by Bentley and McIlroy in Ref. [11]. It confirms that PPMII compresses much better than gzip.…”

Section: Complex Gain Functionsupporting

confidence: 69%

Section: Finding Long Repeated Stringsmentioning

confidence: 99%

See 1 more Smart Citation

PPM with the extended alphabet

Skibiński

2006

Information Sciences

View full text Add to dashboard Cite

“…To encode large repositories, long repeated strings are identified and then delta encoding is applied [26]. As a practical industrial example, Google adopted such a technique [2] for handling of long repeated strings in the collection data of their Bigtable system [6]. RLZ can also be regarded as an application of string substitution, though the substitutions refer to an external dictionary rather than to previous parts of the collection itself.…”

Section: Collection Compression Methodsmentioning

confidence: 99%

“…Budget for RAM-resident dictionary F(C, D) Factorization of (collection) C against (dictionary) D: a sequence of factors R(C, D) The corresponding compression ratio of RLZ for (collection) C and (dictionary) D C 1 Concatenation of all 'small' factors from C2 that are picked by the CuD algorithm D 2 Dictionary sampled from C L 2 (for Figure 5) D 1 Dictionary from the initial tranche (baseline in Table 3) D o 2 Dictionary generated from C o 2 , which is large enough and will not be concatenated with D1 (baseline in Table 6) …”

mentioning

confidence: 99%

Compact Auxiliary Dictionaries for Incremental Compression of Large Repositories

Tong

Wirth

Zobel

2014

Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management

View full text Add to dashboard Cite

Compression is widely exploited in retrieval systems, such as search engines and text databases, to lower both retrieval costs and system latency. In particular, compression of repositories can reduce storage requirements and fetch times, while improving caching. One of the most effective techniques is relative Lempel-Ziv, RLZ, in which a RAM-resident dictionary encodes the collection. With RLZ, a specified document can be decoded independently and extremely fast, while maintaining a high compression ratio. For terabytescale collections, this dictionary need only be a fraction of a per cent of the original data size. However, as originally described, RLZ uses a static dictionary, against which encoding of new data may be inefficient. An obvious alternative is to generate a new dictionary solely from the new data. However, this approach may not be scalable because the combined RAM-resident dictionary will grow in proportion to the collection.In this paper, we describe effective techniques for extending the original dictionary to manage new data. With these techniques, a new auxiliary dictionary, relatively limited in size, is created by interrogating the original dictionary with the new data. Then, to compress this new data, we combine the auxiliary dictionary with some parts of the original dictionary (the latter in fact encoded as pointers into that original dictionary) to form a second dictionary. Our results show that excellent compression is available with only small auxiliary dictionaries, so that RLZ can feasibly transmit and store large, growing collections.

show abstract

Lempel‐Ziv compression of highly structured documents

Adiego

Navarro

Fuente

2007

J. Am. Soc. Inf. Sci.

View full text Add to dashboard Cite

The authors describe Lempel-Ziv to Compress Structure (LZCS), a novel Lempel-Ziv approach suitable for compressing structured documents. LZCS takes advantage of repeated substructures that may appear in the documents, by replacing them with a backward reference to their previous occurrence. The result of the LZCS transformation is still a valid structured document, which is human-readable and can be transmitted by ASCII channels. Moreover, LZCS transformed documents are easy to search, display, access at random, and navigate. In a second stage, the transformed documents can be further compressed using any semistatic technique, so that it is still possible to do all those operations efficiently; or with any adaptive technique to boost compression. LZCS is especially efficient in the compression of collections of highly structured data, such as extensible markup language (XML) forms, invoices, e-commerce, and Web-service exchange documents. The comparison with other structure-aware and standard compressors shows that LZCS is a competitive choice for these type of documents, whereas the others are not well-suited to support navigation or random access. When joined to an adaptive compressor, LZCS obtains by far the best compression ratios.

show abstract

Data compression with long repeated strings

Cited by 20 publications

References 9 publications

PPM with the extended alphabet

PPM with the extended alphabet

Compact Auxiliary Dictionaries for Incremental Compression of Large Repositories

Lempel‐Ziv compression of highly structured documents

Contact Info

Product

Resources

About