Access Time Tradeoffs in Archive Compression

Petri, Matthias; Moffat, Alistair; Nagesh, P. C.; Wirth, Anthony

doi:10.1007/978-3-319-28940-3_2

Cited by 5 publications

(10 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The effect of this alphabet partitioning is better compression for the EFcoded values, which on this file are the dominant type; confirming that this option is Case Study, Text Factorization. The Relative Lempel-Ziv (RLZ) compression mechanism represents a string STR as a sequence of factors from a dictionary D, see Petri et al [22] for a description and experimental results. To greedily determine longest factors using a CSA, we take T = D r , the reverse of D, and build a compressed index.…”

Section: Methodsmentioning

confidence: 99%

CSA++: Fast Pattern Search for Large Alphabets

Gog¹,

Moffat²,

Petri³

2017

2017 Proceedings of the Ninteenth Workshop on Algorithm Engineering and Experiments (ALENEX)

Self Cite

View full text Add to dashboard Cite

Indexed pattern search in text has been studied for many decades. For small alphabets, the FM-Index provides unmatched performance for Count operations, in terms of both space required and search speed. For large alphabets -for example, when the tokens are words -the situation is more complex, and FM-Index representations are compact, but potentially slow. In this paper we apply recent innovations from the field of inverted indexing and document retrieval to compressed pattern search, including for alphabets into the millions. Commencing with the practical compressed suffix array structure developed by Sadakane, we show that the Elias-Fano code-based approach to document indexing can be adapted to provide new tradeoff options in indexed pattern search, and offers significantly faster pattern processing compared to previous implementations, as well as reduced space requirements. We report a detailed experimental evaluation that demonstrates the relative advantages of the new approach, using the standard Pizza&Chili methodology and files, as well as applied use-cases derived from large-scale data compression, and from natural language processing. For large alphabets, the new structure gives rise to space requirements that are close to those of the most highly-compressed FM-Index variants, in conjunction with unparalleled Count throughput rates.

show abstract

Section: Methodsmentioning

confidence: 99%

CSA++: Fast Pattern Search for Large Alphabets

Gog¹,

Moffat²,

Petri³

2017

2017 Proceedings of the Ninteenth Workshop on Algorithm Engineering and Experiments (ALENEX)

Self Cite

View full text Add to dashboard Cite

show abstract

“…So that document retrieval and access is fast, the sequence C is divided into fixed-length blocks (except for the last block), each of which is independently factored (in a left-to-right greedy manner) against the dictionary D. Factorization converts each block into a sequence of offset, length pairs, unless the next symbol is not previously seen, in which case length-zero literal "factor" is generated [5]. Petri et al [12] note that it is beneficial to impose a lower limit on the length of each factor, and following their recommendation, we employ a four-byte threshold. That is, if the greedy match from the current position i in C is of length three or fewer bytes long, then a literal that covers that factor is generated.…”

Section: Background and Related Workmentioning

confidence: 99%

“…Each of the offset, length and literal streams is encoded separately. Previous experimentation has shown that compressing each with ZLib and using source blocks of size between 16 kiB and 64 kiB provides a good compromise between compression effectiveness and decompression speed [12]. With typical factor lengths of around 20 bytes or more, each of the three compressed integer streams generated for each block corresponds to, at most, approximately 3,000 integers.…”

Section: Background and Related Workmentioning

confidence: 99%

“…The key advantage of RLZ is that pseudo-random access to the compressed text is possible, since each block can be independently decoded; yet compression effectiveness is relatively high, since the dictionary is drawn (in some way) from the whole of the text being stored. Recent experimentation exploring the trade-off between access efficiency and compression effectiveness showed that -even with a relatively small RAM-resident dictionary -RLZ factorization and ZLib-based encoding of the integer streams is competitive against state-of-the-art dynamic compression schemes [12].…”

Section: Introductionmentioning

confidence: 99%

“…Ineffective compression typically gives rise to a greater number of factors to be decoded: as a general pattern, the better the compression effectiveness, the faster the decoding rate. Petri et al [12] provide extensive measurements of decoding rates and random access costs on both mechanical and solid-state disk drives, and demonstrate that even on hard-disk drives, requests to access small fragments of the compressed collection can be carried out at a rate of 100 per second. We similarly omit dictionary construction and encoding times, but note that our proposal here operates at similar rates to the CARE mechanism of Tong et al [17].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Effective Construction of Relative Lempel-Ziv Dictionaries

Liao

Petri

Moffat

et al. 2016

Proceedings of the 25th International Conference on World Wide Web

Self Cite

View full text Add to dashboard Cite

Web crawls generate vast quantities of text, retained and archived by the search services that initiate them. To store such data and to allow storage costs to be minimized, while still providing some level of random access to the compressed data, efficient and effective compression techniques are critical. The Relative Lempel Ziv (RLZ) scheme provides fast decompression and retrieval of documents from within large compressed collections, and even with a relatively small RAM-resident dictionary, is competitive relative to adaptive compression schemes. To date, the dictionaries required by RLZ compression have been formed from concatenations of substrings regularly sampled from the underlying document collection, then pruned in a manner that seeks to retain only the high-use sections. In this work, we develop new dictionary design heuristics, based on effective construction, rather than on pruning; we identify dictionary construction as a (string) covering problem. To avoid the complications of string covering algorithms on large collections, we focus on k-mers and their frequencies. First, with a reservoir sampler, we efficiently identify the most common k-mers. Then, since a collection typically comprises regions of local similarity, we select in each "epoch" a segment whose k-mers together achieve, locally, the highest coverage score. The dictionary is formed from the concatenation of these epoch-derived segments. Our selection process is inspired by the greedy approach to the Set Cover problem. Compared with the best existing pruning method, CARE, our scheme has a similar construction time, but achieves better compression effectiveness. Over several multi-gigabyte document collections, there are relative gains of up to 27%. Copyright is held by the International World Wide Web Conference Committee (IW3C2). IW3C2 reserves the right to provide a hyperlink to the author's site if the Material is used in electronic media.

show abstract

Computing the Cost of Compressed Data

Moffat¹,

Petri²

2018

Encyclopedia of Big Data Technologies

View full text Add to dashboard Cite

Access Time Tradeoffs in Archive Compression

Cited by 5 publications

References 8 publications

CSA++: Fast Pattern Search for Large Alphabets

CSA++: Fast Pattern Search for Large Alphabets

Effective Construction of Relative Lempel-Ziv Dictionaries

Computing the Cost of Compressed Data

Contact Info

Product

Resources

About