Faster Compact Top-k Document Retrieval

Konow, Roberto; Navarro, Gonzalo

doi:10.1109/dcc.2013.43

Cited by 23 publications

(34 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We use various document collections, following previous work [13,7] and exploring different aspects of statistical compressibility, size, number of documents, and repetitiveness: ClueWiki (English, few large documents), DNA (synthetic, mildly repetitive with 5% mutations among documents), KGS (Go game records), Wiki (more and shorter documents), Proteins (many more documents, almost incompressible), and TodoCL (a snapshot of the Chilean Web, with real queries, used to measure quality). Table 1 shows their main characteristics (column "compress" shows how the LZ78-based Unix Compress program compresses them).…”

Section: Resultsmentioning

confidence: 99%

“…Navarro and Nekrich [12] reduced the time to the optimal O(m + k). Konow and Navarro [7] implemented this index, obtaining an index that uses 20-30 bits per character (bpc) 1 and answers top-k queries in k to 4k microseconds (µsec). Their time complexity is O(m + (k + log log n) log log n) with high probability, on statistically typical texts [15].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Efficient Compressed Indexing for Approximate Top-k String Retrieval

Ferrada

Navarro

2014

String Processing and Information Retrieval

View full text Add to dashboard Cite

Abstract. Given a collection of strings (called documents), the top-k document retrieval problem is that of, given a string pattern p, finding the k documents where p appears most often. This is a basic task in most information retrieval scenarios. The best current implementations require 20-30 bits per character (bpc) and k to 4k microseconds per query, or 12-24 bpc and 1-10 milliseconds per query. We introduce a Lempel-Ziv compressed data structure that occupies 5-10 bpc to answer queries in around k microseconds. The drawback is that the answer is approximate, but we show that its quality improves asymptotically with the size of the collection, being over 85% already for patterns of length 4-6 on rather small collections, and improving for larger ones.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Efficient Compressed Indexing for Approximate Top-k String Retrieval

Ferrada

Navarro

2014

String Processing and Information Retrieval

View full text Add to dashboard Cite

show abstract

“…Konow and Navarro [21] achieved O(p + (lg lg n) 2 + k lg lg n) time within |CSA| + (n lg D + 4n lg lg n)(1 + o(1)) bits, but the result holds only almost surely on typical texts, not in the worst case. Their index, on the other hand, turns out to be very competitive in practice.…”

Section: Sourcementioning

confidence: 99%

“…For the first part, we can use a data structure [21], which is built in O(n lg n) time and answers a top-k query in O((k+ lg n) lg n) time once the locus of the pattern is given. 4 Adding the O(k lg n) times over all the O(n/c) sampled nodes, for c = k lg 2 k lg ε n, gives O(n lg 1−ε n/ lg 2 k), which added over all the powers of 2 for k gives O(n lg 1−ε n).…”

Section: Constructionmentioning

confidence: 99%

See 1 more Smart Citation

New space/time tradeoffs for top- k document retrieval on sequences

Navarro

Thankachan

2014

Theoretical Computer Science

View full text Add to dashboard Cite

We address the problem of indexing a collection D = {T 1 , T 2 , ...T D } of D string documents of total length n, so that we can efficiently answer top-k queries: retrieve k documents most relevant to a pattern P of length p given at query time.There exist linear-space data structures, that is, using O(n) words, that answer such queries in optimal O(p + k) time for an ample set of notions of relevance. However, using linear space is not sufficiently good for large text collections. In this paper we explore how far the space/time tradeoff for this problem can be pushed. We obtain three results: (1) When relevance is measured as term frequency (number of times P appears in a document T i ), an index occupying |CSA|+o(n) bits answers the query in time O(t search (p)+k lg 2 k lg ε n), where CSA is a compressed suffix array indexing D, t search is its time to find the suffix array interval of P, and ε > 0 is any constant. (2) With the same measure of relevance, an index occupying |CSA| + n lg D + o(n lg σ + n lg D) bits answers the query in time O(t search (p) + k lg * k), where lg * k is the iterated logarithm of k. (3) When the relevance depends only on the documents, an index occupying |CSA| + O(n lg lg n) bits answers the query in O(t search (p) + k t SA ) time, where t SA is the time the CSA needs to retrieve a suffix array cell. On our way, we obtain some other results of independent interest.

show abstract

From Theory to Practice: Plug and Play with Succinct Data Structures

Gog

Beller

Moffat

et al. 2014

Lecture Notes in Computer Science

334

268

View full text Add to dashboard Cite

Engineering efficient implementations of compact and succinct structures is a time-consuming and challenging task, since there is no standard library of easy-touse, highly optimized, and composable components. One consequence is that measuring the practical impact of new theoretical proposals is a difficult task, since older baseline implementations may not rely on the same basic components, and reimplementing from scratch can be very time-consuming. In this paper we present a framework for experimentation with succinct data structures, providing a large set of configurable components, together with tests, benchmarks, and tools to analyze resource requirements. We demonstrate the functionality of the framework by recomposing succinct solutions for document retrieval.

show abstract

Faster Compact Top-k Document Retrieval

Cited by 23 publications

References 26 publications

Efficient Compressed Indexing for Approximate Top-k String Retrieval

Efficient Compressed Indexing for Approximate Top-k String Retrieval

New space/time tradeoffs for top- k document retrieval on sequences

From Theory to Practice: Plug and Play with Succinct Data Structures

Contact Info

Product

Resources

About