Proceedings of the 20th ACM International Conference on Information and Knowledge Management 2011
DOI: 10.1145/2063576.2063646
|View full text |Cite
|
Sign up to set email alerts
|

Indexes for highly repetitive document collections

Abstract: We introduce new compressed inverted indexes for highly repetitive document collections. They are based on runlength, Lempel-Ziv, or grammar-based compression of the differential inverted lists, instead of gap-encoding them as is the usual practice. We show that our compression methods significantly reduce the space achieved by classical compression, at the price of moderate slowdowns. Moreover, many of our methods are universal, that is, they do not need to know the versioning structure of the collection.We a… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
13
0

Year Published

2012
2012
2018
2018

Publication Types

Select...
4
1
1

Relationship

1
5

Authors

Journals

citations
Cited by 16 publications
(13 citation statements)
references
References 28 publications
0
13
0
Order By: Relevance
“…We then describe heuristic optimization algorithms for this problem that can scale to large document collections. Our experiments on large versioned data sets from Wikipedia and the Internet Archive show significant reductions in index size over [32] and [8] with very fast access speeds.…”
Section: Introductionmentioning
confidence: 87%
See 4 more Smart Citations
“…We then describe heuristic optimization algorithms for this problem that can scale to large document collections. Our experiments on large versioned data sets from Wikipedia and the Internet Archive show significant reductions in index size over [32] and [8] with very fast access speeds.…”
Section: Introductionmentioning
confidence: 87%
“…Non-positional versioned indexing: A number of compression methods for non-positional versioned indexes have been proposed [4,15,7,5,11,12,8]. The first work on this problem appears to be the work of Anick and Flynn in [4], which proposes a scheme based on the idea of indexing the delta between consecutive document versions and then adjusting query processing suitably.…”
Section: Indexing Versioned Collectionsmentioning
confidence: 99%
See 3 more Smart Citations