Document Listing on Repetitive Collections

Gagie, Travis; Karhu, Kalle; Navarro, Gonzalo; Puglisi, Simon J.; Sirén, Jouni

doi:10.1007/978-3-642-38905-4_12

Cited by 16 publications

(25 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Although they perform reasonably well in practice, none of the existing structures for document listing on repetitive collections [14,23] offer good worst-case time guarantees combined with worst-case space guarantees that are appropriate for repetitive collections, that is, growing with n+s rather than with N . In this paper we present the first document listing index offering good guarantees in space and time for repetitive collections: our index That is, at the price of being an O(lg D) space factor away from what could be hoped from a grammar-based index, our index offers document listing with useful time bounds per listed document.…”

Section: Our Contributionsmentioning

confidence: 99%

“…We do not store the lists themselves in various orders, but just succinct range minimum query (RMQ) data structures [19] that allow implementing document listing on ranges of lists [51]. Even those RMQ structures are too large for our purposes, so they are further compressed exploiting the fact that their underlying data has long increasing runs, so the structures are reduced with techniques analogous to those developed for the ILCP data structure [23].…”

Section: Our Contributionsmentioning

confidence: 99%

“…Another precedent is ILCP [23], where it is shown that an array formed by interleaving the longest common prefix arrays of the documents in the order of the global suffix array, ILCP, has long increasing runs on repetitive collections. Then an index of size bounded by the runs in the suffix array [36] and in the ILCP array performs document listing in time O(search(m)+lookup(N )·ndoc), where search and lookup are the search and lookup time, respectively, of a run-length compressed suffix array [36,24].…”

Section: Related Workmentioning

confidence: 99%

“…Finally, the RMQ answer is either E[i] or E[k], so we access E twice to compare them. This idea was used by Gagie et al[23, Sec 3.2] for runs of equal values, but it works verbatim for runs of nondecreasing values. They show how to store F in ρ lg(t/ρ) + O(ρ) bits so that it solves rank in O(lg lg t) time and select in O(1) time, by augmenting a sparse bitvector representation[49].…”

mentioning

confidence: 99%

See 3 more Smart Citations

Document listing on repetitive collections with guaranteed performance

Navarro

2019

Theoretical Computer Science

Self Cite

View full text Add to dashboard Cite

We consider document listing on string collections, that is, finding in which strings a given pattern appears. In particular, we focus on repetitive collections: a collection of size N over alphabet [1, σ] is composed of D copies of a string of size n, and s edits are applied on ranges of copies. We introduce the first document listing index with sizeÕ(n + s), precisely O((n lg σ + s lg 2 N ) lg D) bits, and with useful worst-case time guarantees: Given a pattern of length m, the index reports the ndoc > 0 strings where it appears in time O(m lg 1+ N · ndoc), for any constant > 0 (and tells in time O(m lg N ) if ndoc = 0). Our technique is to augment a range data structure that is commonly used on grammar-based indexes, so that instead of retrieving all the pattern occurrences, it computes useful summaries on them. We show that the idea has independent interest: we introduce the first grammar-based index that, on a text T [1, N ] with a grammar of size r, uses O(r lg N ) bits and counts the number of occurrences of a pattern P [1, m] in time O(m 2 + m lg 2+ r), for any constant > 0. We also give the first index using O(z lg(N/z) lg N ) bits, where T is parsed by Lempel-Ziv into z phrases, counting occurrences in time O(m lg 2+ N ).

show abstract

Section: Our Contributionsmentioning

confidence: 99%

Section: Our Contributionsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

mentioning

confidence: 99%

See 2 more Smart Citations

Document listing on repetitive collections with guaranteed performance

Navarro

2019

Theoretical Computer Science

Self Cite

View full text Add to dashboard Cite

show abstract

“…(6) Indexing a highly repetitive or a highly similar document collection is an active line of research. In recent work, Gagie et al [2013] propose an efficient document retrieval index suitable for a repetitive collection. An open problem is to extend the result for handling top-k queries.…”

Section: Resultsmentioning

confidence: 99%

Space-Efficient Frameworks for Top- k String Retrieval

Hon

Shah

Thankachan

et al. 2014

J. ACM

View full text Add to dashboard Cite

The inverted index is the backbone of modern web search engines. For each word in a collection of web documents, the index records the list of documents where this word occurs. Given a set of query words, the job of a search engine is to output a ranked list of the most relevant documents containing the query. However, if the query consists of an arbitrary string-which can be a partial word, multiword phrase, or more generally any sequence of characters-then word boundaries are no longer relevant and we need a different approach. In string retrieval settings, we are given a set D = {d 1 , d 2 , d 3 , . . . , d D } of D strings with n characters in total taken from an alphabet set = [σ ], and the task of the search engine, for a given query pattern P of length p, is to report the "most relevant" strings in D containing P. The query may also consist of two or more patterns. The notion of relevance can be captured by a function score(P, d r ), which indicates how relevant document d r is to the pattern P. Some example score functions are the frequency of pattern occurrences, proximity between pattern occurrences, or pattern-independent PageRank of the document.The first formal framework to study such kinds of retrieval problems was given by Muthukrishnan [SODA 2002]. He considered two metrics for relevance: frequency and proximity. He took a threshold-based approach on these metrics and gave data structures that use O(n log n) words of space. We study this problem in a somewhat more natural top-k framework. Here, k is a part of the query, and the top k most relevant (highestscoring) documents are to be reported in sorted order of score. We present the first linear-space framework (i.e., using O(n) words of space) that is capable of handling arbitrary score functions with near-optimal O( p + k log k) query time. The query time can be made optimal O( p + k) if sorted order is not necessary. Further, we derive compact space and succinct space indexes (for some specific score functions). This space compression comes at the cost of higher query time. At last, we extend our framework to handle the case of multiple patterns. Apart from providing a robust framework, our results also improve many earlier results in index space or query time or both.

show abstract