Improved index compression techniques for versioned document collections

He, Jinru; Zeng, Junyuan; Suel, Torsten

doi:10.1145/1871437.1871594

Cited by 23 publications

(49 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our experiments also show that other classical encodings, such as Simple9 [1] and PforDelta [26], perform surprisingly well on repetitive collections, yet they still require 5 times more space than ours. Our techniques still do not match the performance of He et al's methods [12] when their assumptions hold, but these methods are not universal.…”

Section: Introductionmentioning

confidence: 66%

“…Given a parameter B, it samples the universe of size u at intervals 2 ⌈log 2 (uB/ℓ)⌉ . In the particular case of highly repetitive collections, the best figures so far have been presented by He et al [12] in the non-positional case. They model versioned document collections using so-called two-level indexes.…”

Section: Data Structures For Inverted Listsmentioning

confidence: 96%

“…Both inverted indexes for word and phrase queries over natural language texts [2,5,11,12], and other indexes for general string collections [16,6,14,7], have been pursued.…”

Section: Introductionmentioning

confidence: 99%

“…He et al [11,12] have presented alternative compression methods specifically targeted at highly repetitive collections. Their approach merges all versions of each document for creating the inverted lists and then keeps a secondary index that allows one to list the versions of a document that contain a given term.…”

Section: Introductionmentioning

confidence: 99%

“…Some of those are universal in the sense that they do not need to identify which document is a version of which. The techniques of He et al [11,12] work under a model where there exists a set of independent documents, each of which has a number of versions, and this versioning information must be available to the index. Our techniques can also work on cases where the versions form a tree structure (as in collaborative document creation, software repositories, or phylogenetic trees), or where it is unknown or unclear which documents are versions of which (as in DNA sequence databases).…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Indexes for highly repetitive document collections

Claude

Fariña

Martínez‐Prieto

et al. 2011

Proceedings of the 20th ACM International Conference on Information and Knowledge Management

View full text Add to dashboard Cite

We introduce new compressed inverted indexes for highly repetitive document collections. They are based on runlength, Lempel-Ziv, or grammar-based compression of the differential inverted lists, instead of gap-encoding them as is the usual practice. We show that our compression methods significantly reduce the space achieved by classical compression, at the price of moderate slowdowns. Moreover, many of our methods are universal, that is, they do not need to know the versioning structure of the collection.We also introduce compressed self-indexes in the comparison. We show that techniques can compress much further, using a small fraction of the space required by our new inverted indexes, yet they are orders of magnitude slower.

show abstract

Section: Introductionmentioning

confidence: 66%

Section: Data Structures For Inverted Listsmentioning

confidence: 96%

“…Both inverted indexes for word and phrase queries over natural language texts [2,5,11,12], and other indexes for general string collections [16,6,14,7], have been pursued.…”

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Indexes for highly repetitive document collections

Claude

Fariña

Martínez‐Prieto

et al. 2011

Proceedings of the 20th ACM International Conference on Information and Knowledge Management

View full text Add to dashboard Cite

show abstract

Self-indexing Based on LZ77

Kreft

Navarro

2011

Lecture Notes in Computer Science

View full text Add to dashboard Cite

We introduce the first self-index based on the Lempel-Ziv 1977 compression format (LZ77). It is particularly competitive for highly repetitive text collections such as sequence databases of genomes of related species, software repositories, versioned document collections, and temporal text databases. Such collections are extremely compressible but classical self-indexes fail to capture that source of compressibility. Our self-index takes in practice a few times the space of the text compressed with LZ77 (as little as 2.6 times), extracts 1-2 million characters of the text per second, and finds patterns at a rate of 10-50 microseconds per occurrence. It is smaller (up to one half) than the best current self-index for repetitive collections, and faster in many cases.

show abstract

Document retrieval on repetitive string collections

et al. 2017

View full text Add to dashboard Cite

Most of the fastest-growing string collections today are repetitive, that is, most of the constituent documents are similar to many others. As these collections keep growing, a key approach to handling them is to exploit their repetitiveness, which can reduce their space usage by orders of magnitude. We study the problem of indexing repetitive string collections in order to perform efficient document retrieval operations on them. Document retrieval problems are routinely solved by search engines on large natural language collections, but the techniques are less developed on generic string collections. The case of repetitive string collections is even less understood, and there are very few existing solutions. We develop two novel ideas, interleaved LCPs and precomputed document lists, that yield highly compressed indexes solving the problem of document listing (find all the documents where a string appears), top-k document retrieval (find the k documents where a string appears most often), and document counting (count the number of documents where a string appears). We also show that a classical data structure supporting the latter query becomes highly compressible on repetitive data. Finally, we show how the tools we developed can be combined to solve ranked conjunctive and disjunctive multi-term queries under the simple model of relevance. We thoroughly evaluate the resulting techniques in various real-life repetitiveness scenarios, and recommend the best choices for each case.

show abstract

Improved index compression techniques for versioned document collections

Cited by 23 publications

References 32 publications

Indexes for highly repetitive document collections

Indexes for highly repetitive document collections

Self-indexing Based on LZ77

Document retrieval on repetitive string collections

Contact Info

Product

Resources

About