2015
DOI: 10.1007/978-3-319-28940-3_2
|View full text |Cite
|
Sign up to set email alerts
|

Access Time Tradeoffs in Archive Compression

Abstract: Abstract. Web archives, query and proxy logs, and so on, can all be very large and highly repetitive; and are accessed only sporadically and partially, rather than continually and holistically. This type of data is ideal for compression-based archiving, provided that random-access to small fragments of the original data can be achieved without needing to decompress everything. The recent RLZ (relative Lempel Ziv) compression approach uses a semi-static model extracted from the text to be compressed, together w… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

2
8
0

Year Published

2016
2016
2019
2019

Publication Types

Select...
3
2

Relationship

2
3

Authors

Journals

citations
Cited by 5 publications
(10 citation statements)
references
References 8 publications
2
8
0
Order By: Relevance
“…The effect of this alphabet partitioning is better compression for the EFcoded values, which on this file are the dominant type; confirming that this option is Case Study, Text Factorization. The Relative Lempel-Ziv (RLZ) compression mechanism represents a string STR as a sequence of factors from a dictionary D, see Petri et al [22] for a description and experimental results. To greedily determine longest factors using a CSA, we take T = D r , the reverse of D, and build a compressed index.…”
Section: Methodsmentioning
confidence: 99%
“…The effect of this alphabet partitioning is better compression for the EFcoded values, which on this file are the dominant type; confirming that this option is Case Study, Text Factorization. The Relative Lempel-Ziv (RLZ) compression mechanism represents a string STR as a sequence of factors from a dictionary D, see Petri et al [22] for a description and experimental results. To greedily determine longest factors using a CSA, we take T = D r , the reverse of D, and build a compressed index.…”
Section: Methodsmentioning
confidence: 99%
“…So that document retrieval and access is fast, the sequence C is divided into fixed-length blocks (except for the last block), each of which is independently factored (in a left-to-right greedy manner) against the dictionary D. Factorization converts each block into a sequence of offset, length pairs, unless the next symbol is not previously seen, in which case length-zero literal "factor" is generated [5]. Petri et al [12] note that it is beneficial to impose a lower limit on the length of each factor, and following their recommendation, we employ a four-byte threshold. That is, if the greedy match from the current position i in C is of length three or fewer bytes long, then a literal that covers that factor is generated.…”
Section: Background and Related Workmentioning
confidence: 99%
“…Each of the offset, length and literal streams is encoded separately. Previous experimentation has shown that compressing each with ZLib and using source blocks of size between 16 kiB and 64 kiB provides a good compromise between compression effectiveness and decompression speed [12]. With typical factor lengths of around 20 bytes or more, each of the three compressed integer streams generated for each block corresponds to, at most, approximately 3,000 integers.…”
Section: Background and Related Workmentioning
confidence: 99%
See 2 more Smart Citations