Document retrieval on repetitive string collections

Gagie, Travis; Hartikainen, Aleksi; Karhu, Kalle; Kärkkäinen, Juha; Navarro, Gonzalo; Puglisi, Simon J.; Sirén, Jouni

doi:10.1007/s10791-017-9297-7

Cited by 14 publications

(43 citation statements)

References 47 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This is confirmed by the experiments. We showcase our new solution in one specific self-indexbased document retrieval framework, but we point out that this component can also be utilized in other variants as presented by Gagie et al [3].…”

Section: Resultsmentioning

confidence: 99%

See 1 more Smart Citation

Elias-Fano meets Single-Term Top-k Document Retrieval

Labeit¹,

Gog²

2017

2017 Proceedings of the Ninteenth Workshop on Algorithm Engineering and Experiments (ALENEX)

View full text Add to dashboard Cite

A fundamental problem in Information Retrieval is to determine the k most relevant documents of a collection for a given query word or phrase P . In a recent result, Navarro and Nekrich [SODA 2012] showed that this problem can be solved in optimal time complexity of O(|P | + k) with a precomputed linear-space index. The size of this optimal-time index was estimated to be 80 times the collection size, rendering it not to be practical. In subsequent work, Navarro and Konow [DCC 2013] and Gog and Navarro [ALENEX 2015] created a practical version with slightly worse query time guarantees but reduced the space to 2.5 − 3 times the collection size. The index is conceptually simple and is divided in five components. In this paper we show how the n log N bits required by the usually largest component -the so called repetition array -can be reduced to n log log n + O(n), where n is the size of the collection and N the number of documents. As the overall query time complexity matches the one of the old index, we achieve a theoretically superior time-space trade-off. We explore the practical properties of the improved index in a detailed experimental study and compare to the previously established baseline. Index sizes are now between 1.5 − 2 times the collection size while query speed is comparable to the larger indexes. We also show that the new approach automatically adapts to highly repetitive text collections, which are for instance produced by version control systems.

show abstract

Section: Resultsmentioning

confidence: 99%

“…E.g. for d 0 we mark the root as it is the LCA of the suffix pair (5,0), and node v 9 for suffix pair (0,2), again the root for (2,4), and v 13 for (4,1) and (1,3). Connecting all nodes marked with with a specific d i (see the green arrows in Fig.…”

Section: The Basic Framework and Data Structuresmentioning

confidence: 99%

Elias-Fano meets Single-Term Top-k Document Retrieval

Labeit¹,

Gog²

2017

2017 Proceedings of the Ninteenth Workshop on Algorithm Engineering and Experiments (ALENEX)

View full text Add to dashboard Cite

show abstract

“…To date, there exist several pattern matching indexes for repetitive text collections (see a couple of studies [21,10] and references therein). However, there are not many document retrieval indexes for repetitive text collections [5,8,23]. Most of these indexes [26,8] rely on a pattern-matching index needs Ω(n) bits in order to offer O(lg n) time per retrieved document.In this paper we introduce new simple and efficient document listing indexes aimed at highly repetitive text collections.…”

mentioning

confidence: 99%

“…However, there are not many document retrieval indexes for repetitive text collections [5,8,23]. Most of these indexes [26,8] rely on a pattern-matching index needs Ω(n) bits in order to offer O(lg n) time per retrieved document.In this paper we introduce new simple and efficient document listing indexes aimed at highly repetitive text collections. Like various preceding indexes, we achieve O(m+ndoc ·lg n) search time, yet our indexes are way faster and/or smaller than previous ones on various repetitive datasets, because they escape from the space/time tradeoff of the pattern-matching index.…”

mentioning

confidence: 99%

Fast, Small, and Simple Document Listing on Repetitive Text Collections

Cobas

Navarro

2019

String Processing and Information Retrieval

Self Cite

View full text Add to dashboard Cite

Document listing on string collections is the task of finding all documents where a pattern appears. It is regarded as the most fundamental document retrieval problem, and is useful in various applications. Many of the fastest-growing string collections are composed of very similar documents, such as versioned code and document collections, genome repositories, etc. Plain pattern-matching indexes designed for repetitive text collections achieve orders-of-magnitude reductions in space. Instead, there are not many analogous indexes for document retrieval. In this paper we present a simple document listing index for repetitive string collections of total length n that lists the ndoc distinct documents where a pattern of length m appears in time O(m + ndoc · lg n). We exploit the repetitiveness of the document array (i.e., the suffix array coarsened to document identifiers) to grammar-compress it while precomputing the answers to nonterminals, and store them in grammar-compressed form as well. Our experimental results show that our index sharply outperforms existing alternatives in the space/time tradeoff map.Muthunkishnan [20] designed the first linear-space and optimal-time index for general string collections. Given a collection of total length n, he builds an index of O(n) words that lists the ndoc documents where a pattern of length m appears in time O(m + ndoc). While linear space is deemed as sufficiently small in classic scenarios, the solution is impractical for very large text collections unless one resorts to disk, which is orders of magnitude slower. Sadakane [26] showed how to reduce the space of Muthukrishnan's index to that of the statistically-compressed text plus O(n) bits, while raising the time complexity to only O(m + ndoc · lg n) if the appropriate underlying pattern-matching index is used [2].The sharp growth of text collections is a concern in many recent applications, outperforming Moore's Law in some cases [27]. Fortunately, many of the fastest-growing text collections are highly repetitive: each document can be obtained from a few large blocks of other documents. These collections arise in different areas, such as repositories of genomes of the same species (which differ from each other by a small percentage only) like the 100K-genome project 1 , software repositories that store all the versions of the code arranged in a tree or acyclic graph like GitHub 2 , versioned document repositories where each document has a timeline of versions like Wikipedia 3 , etc. On such text collections, statistical compression is ineffective [14] and even O(n) bits of extra space can be unaffordable.Repetitiveness is the key to tackle the fast growth of these collections: their amount of new material grows much slower than their size. For example, version control systems compress those collections by storing the list of edits with respect to some reference document that is stored in plain form, and reconstruct it by applying the edits to the reference version. Much more challenging, however, is to index those ...

show abstract

“…Their work, using large query logs, provides new insights into the relative efficiency of selective search compared to exhaustive random sharding, how to distribute those shards across machines, and yields details of trade-offs possible between throughput and latency constraints. Gagie et al (2017) examine indexing for repetitive collections. Their work includes effective compression techniques, methods for top-k retrieval and identifying the number of documents containing a given string.…”

mentioning

confidence: 99%

Efficiency in information retrieval: introduction to special issue

Hawking¹,

Moffat

Trotman

2017

Inf Retrieval J

View full text Add to dashboard Cite

The efficiency of information retrieval (IR) algorithms has always been of interest to researchers at the computer science end of the IR field, and index compression techniques, intersection and ranking algorithms, and pruning mechanisms have been a constant feature of IR conferences and journals over many years. Efficiency is also of serious economic concern to operators of commercial web search engines, where a cluster of a thousand or more computers might participate in processing a single query, and where such clusters of machines might be replicated hundreds of times to handle the query load (Dean 2009). In this environment even relatively small improvements in query processing efficiency could potentially save tens of millions of dollars per year in terms of hardware and energy costs, and at the same time significantly reduce greenhouse gas emissions.In commercial data centres, query processing is by no means the only big IR consumer of server processing cycles. Crawling, indexing, format conversion, PageRank calculation, ranker training, deep learning, knowledge graph generation and processing, social network analysis, query classification, natural language processing, speech processing, question answering, query auto-completion, related search mechanisms, navigation systems and ad targeting are also computationally expensive, and potentially capable of being made more efficient. Data centers running such services are replicated across the world, and their operations provide every-day input to the lives of billions of people. Information retrieval algorithms also run at large scale in cloud-based services and in social media sites such as Facebook and Twitter.Efficiency in indexing and searching email and documents in a multi-tenant cloud is important, and difficult to achieve. Even so, when the individual enterprise search applications are small in scale, the investment of programmer time to achieve gains in efficiency can soon pay for itself in reduced server hosting costs.

show abstract

Document retrieval on repetitive string collections

Cited by 14 publications

References 47 publications

Elias-Fano meets Single-Term Top-k Document Retrieval

Elias-Fano meets Single-Term Top-k Document Retrieval

Fast, Small, and Simple Document Listing on Repetitive Text Collections

Efficiency in information retrieval: introduction to special issue

Contact Info

Product

Resources

About