Proceedings of the 18th ACM Conference on Information and Knowledge Management 2009
DOI: 10.1145/1645953.1646008
|View full text |Cite
|
Sign up to set email alerts
|

Compact full-text indexing of versioned document collections

Abstract: We study the problem of creating highly compressed fulltext index structures for versioned document collections, that is, collections that contain multiple versions of each document. Important examples of such collections are Wikipedia or the web page archive maintained by the Internet Archive. A straightforward indexing approach would simply treat each document version as a separate document, such that index size scales linearly with the number of versions. However, several authors have recently studied appro… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
47
0

Year Published

2010
2010
2017
2017

Publication Types

Select...
3
2
2

Relationship

0
7

Authors

Journals

citations
Cited by 29 publications
(47 citation statements)
references
References 33 publications
0
47
0
Order By: Relevance
“…Indexing versioned document collections has been studied in [7,25,14,13]. Broder et al [7] propose a technique that exploits large content overlaps between documents to achieve a reduction in index size.…”
Section: Indexing Versioned Document Collectionsmentioning
confidence: 99%
See 2 more Smart Citations
“…Indexing versioned document collections has been studied in [7,25,14,13]. Broder et al [7] propose a technique that exploits large content overlaps between documents to achieve a reduction in index size.…”
Section: Indexing Versioned Document Collectionsmentioning
confidence: 99%
“…[25] uses content-dependent partitioning technique [21] to partition a page into smaller fragments such that more fragments are common between versions. More recent approaches by Hersovici et al [14] and He et al [13] exploit arbitrary content overlaps between documents to reduce index size. [14] attempt to find subsets of terms that are contained in consecutive versions of a document.…”
Section: Indexing Versioned Document Collectionsmentioning
confidence: 99%
See 1 more Smart Citation
“…Both inverted indexes for word and phrase queries over natural language texts [2,5,11,12], and other indexes for general string collections [16,6,14,7], have been pursued.…”
Section: Introductionmentioning
confidence: 99%
“…He et al [11,12] have presented alternative compression methods specifically targeted at highly repetitive collections. Their approach merges all versions of each document for creating the inverted lists and then keeps a secondary index that allows one to list the versions of a document that contain a given term.…”
Section: Introductionmentioning
confidence: 99%