Optimizing positional index structures for versioned document collections

He, Jinru; Suel, Torsten

doi:10.1145/2348283.2348319

Cited by 10 publications

(8 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…He and Suel [34] also designed a positional inverted index for the repetitive scenario. They apply a previous technique to partition documents into fragments [67] and then use their non-positional approach [36] on the fragments.…”

Section: Related Workmentioning

confidence: 99%

“…Our universal techniques, instead, also work on settings where the versions form a tree structure (as in collaborative document creation, software repositories, or phylogenetic trees), or where the versions form a continuous stream of incremental changes (as in periodic publications of technical data), or where it is unknown or unclear which documents are versions of which (as in DNA sequence databases, or near-duplicate pages in Web crawls). [34] also designed a positional inverted index for the repetitive scenario. They apply a previous technique to partition documents into fragments [67] and then use their non-positional approach [36] on the fragments.…”

Section: Related Workmentioning

confidence: 99%

“…There is a burst of recent activity in exploiting repetitiveness at the indexing structures, in order to provide fast searches in the collection within little space. Both inverted indexes for word and phrase queries over natural language texts [3,12,35,65,36,34], and other indexes for general string collections [43,16,19,20,40,28,29,26,9], have been pursued.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Universal indexes for highly repetitive document collections

Claude

Fariña²,

Martínez‐Prieto³

et al. 2016

Information Systems

View full text Add to dashboard Cite

Indexing highly repetitive collections has become a relevant problem with the emergence of large repositories of versioned documents, among other applications. These Collections may reach huge sizes, but are formed mostly of documents that are near-copies of others. Traditional techniques for indexing these collections fail to properly exploit their regularities in order to reduce space. We introduce new techniques for compressing inverted indexes that exploit this near copy regularity. They are based on run-length, Lempel-Ziv, or grammar compression of the differential inverted lists, instead of the usual practice of gap-encoding them. We show that, in this highly repetitive setting, our compression methods significantly reduce the space obtained with classical techniques, at the price of moderate slowdowns. Moreover, our best methods are universal, that is, they do not need to know the versioning structure of the collection, nor that a clear versioning structure even exists. We also introduce compressed self-indexes in the comparison. These are designed for general strings (not only natural language texts) and represent the text collection plus the index structure (not an inverted index) in integrated form. We show that these techniques can compress much further, using a small fraction of the space required by our new inverted indexes. Yet they are orders of magnitude slower. (C) 2016 Elsevier Ltd. All rights reservedEuropean Union 690941 Fondecyt (Conicyt, Chile) 1-140796 MINECO (PGE) TIN2013-47090-C3-3-P TIN2015-69951-R TIN2013-46238-C4-3-R MINECO (FEDER) TIN2013-47090-C3-3-P TIN2015-69951-R TIN2013-46238-C4-3-R ICT COST Action IC1302 Xunta de Galicia (FEDER) (Spain) GRC2013/05

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Universal indexes for highly repetitive document collections

Claude

Fariña²,

Martínez‐Prieto³

et al. 2016

Information Systems

View full text Add to dashboard Cite

show abstract

“…While the algorithms proposed by Canzar et al [3] and Ferdous et al [4] employ a set of candidate building blocks consisting of every possible substring in S, DISFY works on a reduced candidate set filtered by their frequencies in S. The frequency-based partitioning or discovering of fragments has been successfully applied previously to index and query in versioned document collections [8]. In this line, we may also highlight the similarity between the constructive procedure to find the basic building blocks in S and the process of discovering association rules, which is one of the most common data mining techniques Tsay and Chiang [14], Zhang and Zhang [15].…”

Section: Stage 1: Discover Building Blocksmentioning

confidence: 99%

A two-stage constructive method for the unweighted minimum string cover problem

Lozano

Rodríguez

García-Martínez

2015

Knowledge-Based Systems

View full text Add to dashboard Cite

“…In the case of natural language, there exist various proposals to reduce the inverted index size by exploiting the text repetitiveness (Anick and Flynn, 1992;Broder et al, 2006;He et al, 2009He et al, , 2010He and Suel, 2012;Claude et al, 2016). For general string collections, the situation is much worse.…”

Section: Introductionmentioning

confidence: 99%