Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval 2012
DOI: 10.1145/2348283.2348319
|View full text |Cite
|
Sign up to set email alerts
|

Optimizing positional index structures for versioned document collections

Abstract: Versioned document collections are collections that contain multiple versions of each document. Important examples are Web archives, Wikipedia and other wikis, or source code and documents maintained in revision control systems. Versioned document collections can become very large, due to the need to retain past versions, but there is also a lot of redundancy between versions that can be exploited. Thus, versioned document collections are usually stored using special differential (delta) compression techniques… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
7
0
1

Year Published

2014
2014
2020
2020

Publication Types

Select...
4
1
1

Relationship

0
6

Authors

Journals

citations
Cited by 10 publications
(8 citation statements)
references
References 35 publications
0
7
0
1
Order By: Relevance
“…He and Suel [34] also designed a positional inverted index for the repetitive scenario. They apply a previous technique to partition documents into fragments [67] and then use their non-positional approach [36] on the fragments.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…He and Suel [34] also designed a positional inverted index for the repetitive scenario. They apply a previous technique to partition documents into fragments [67] and then use their non-positional approach [36] on the fragments.…”
Section: Related Workmentioning
confidence: 99%
“…Our universal techniques, instead, also work on settings where the versions form a tree structure (as in collaborative document creation, software repositories, or phylogenetic trees), or where the versions form a continuous stream of incremental changes (as in periodic publications of technical data), or where it is unknown or unclear which documents are versions of which (as in DNA sequence databases, or near-duplicate pages in Web crawls). [34] also designed a positional inverted index for the repetitive scenario. They apply a previous technique to partition documents into fragments [67] and then use their non-positional approach [36] on the fragments.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…While the algorithms proposed by Canzar et al [3] and Ferdous et al [4] employ a set of candidate building blocks consisting of every possible substring in S, DISFY works on a reduced candidate set filtered by their frequencies in S. The frequency-based partitioning or discovering of fragments has been successfully applied previously to index and query in versioned document collections [8]. In this line, we may also highlight the similarity between the constructive procedure to find the basic building blocks in S and the process of discovering association rules, which is one of the most common data mining techniques Tsay and Chiang [14], Zhang and Zhang [15].…”
Section: Stage 1: Discover Building Blocksmentioning
confidence: 99%
“…In the case of natural language, there exist various proposals to reduce the inverted index size by exploiting the text repetitiveness (Anick and Flynn, 1992;Broder et al, 2006;He et al, 2009He et al, , 2010He and Suel, 2012;Claude et al, 2016). For general string collections, the situation is much worse.…”
Section: Introductionmentioning
confidence: 99%