Compact full-text indexing of versioned document collections

He, Jinru; Yan, Hao; Suel, Torsten

doi:10.1145/1645953.1646008

Cited by 29 publications

(47 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Indexing versioned document collections has been studied in [7,25,14,13]. Broder et al [7] propose a technique that exploits large content overlaps between documents to achieve a reduction in index size.…”

Section: Indexing Versioned Document Collectionsmentioning

confidence: 99%

“…[25] uses content-dependent partitioning technique [21] to partition a page into smaller fragments such that more fragments are common between versions. More recent approaches by Hersovici et al [14] and He et al [13] exploit arbitrary content overlaps between documents to reduce index size. [14] attempt to find subsets of terms that are contained in consecutive versions of a document.…”

Section: Indexing Versioned Document Collectionsmentioning

confidence: 99%

“…Each subset is stored into a virtual document and the total storage cost is optimized by minimizing the overall number and size of the virtual documents. [13] propose a two-level index compression that improves the query processing time. This approach groups similar union-documents into clusters, where a union-document contains all terms in the corresponding versions, and the terms are compressed locally for each cluster.…”

Section: Indexing Versioned Document Collectionsmentioning

confidence: 99%

See 2 more Smart Citations

Durable top-k search in document archives

Mamoulis

Berberich

et al. 2010

Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data

View full text Add to dashboard Cite

We propose and study a new ranking problem in versioned databases. Consider a database of versioned objects which have different valid instances along a history (e.g., documents in a web archive). Durable top-k search finds the set of objects that are consistently in the top-k results of a query (e.g., a keyword query) throughout a given time interval (e.g., from June 2008 to May 2009). Existing work on temporal top-k queries mainly focuses on finding the most representative top-k elements within a time interval. Such methods are not readily applicable to durable top-k queries. To address this need, we propose two techniques that compute the durable top-k result. The first is adapted from the classic top-k rank aggregation algorithm NRA. The second technique is based on a shared execution paradigm and is more efficient than the first approach. In addition, we propose a special indexing technique for archived data. The index, coupled with a space partitioning technique, improves performance even further. We use data from Wikipedia and the Internet Archive to demonstrate the efficiency and effectiveness of our solutions.

show abstract

Section: Indexing Versioned Document Collectionsmentioning

confidence: 99%

Section: Indexing Versioned Document Collectionsmentioning

confidence: 99%

Section: Indexing Versioned Document Collectionsmentioning

confidence: 99%

See 1 more Smart Citation

Durable top-k search in document archives

Mamoulis

Berberich

et al. 2010

Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data

View full text Add to dashboard Cite

show abstract

“…Both inverted indexes for word and phrase queries over natural language texts [2,5,11,12], and other indexes for general string collections [16,6,14,7], have been pursued.…”

Section: Introductionmentioning

confidence: 99%

“…He et al [11,12] have presented alternative compression methods specifically targeted at highly repetitive collections. Their approach merges all versions of each document for creating the inverted lists and then keeps a secondary index that allows one to list the versions of a document that contain a given term.…”

Section: Introductionmentioning

confidence: 99%

Indexes for highly repetitive document collections

Claude

Fariña

Martínez‐Prieto

et al. 2011

Proceedings of the 20th ACM International Conference on Information and Knowledge Management

View full text Add to dashboard Cite

We introduce new compressed inverted indexes for highly repetitive document collections. They are based on runlength, Lempel-Ziv, or grammar-based compression of the differential inverted lists, instead of gap-encoding them as is the usual practice. We show that our compression methods significantly reduce the space achieved by classical compression, at the price of moderate slowdowns. Moreover, many of our methods are universal, that is, they do not need to know the versioning structure of the collection.We also introduce compressed self-indexes in the comparison. We show that techniques can compress much further, using a small fraction of the space required by our new inverted indexes, yet they are orders of magnitude slower.

show abstract

Partially Decompressing Binary Interpolative Coding for Fast Query Processing

Wang

2016

Web Information Systems Engineering – WISE 2016

View full text Add to dashboard Cite

Compact full-text indexing of versioned document collections

Cited by 29 publications

References 33 publications

Durable top-k search in document archives

Durable top-k search in document archives

Indexes for highly repetitive document collections

Partially Decompressing Binary Interpolative Coding for Fast Query Processing

Contact Info

Product

Resources

About