2013 Data Compression Conference 2013
DOI: 10.1109/dcc.2013.43
|View full text |Cite
|
Sign up to set email alerts
|

Faster Compact Top-k Document Retrieval

Abstract: An optimal index solving top-k document retrieval [Navarro and Nekrich, SODA'12] takes O(m + k) time for a pattern of length m, but its space is at least 80n bytes for a collection of n symbols. We reduce it to 1.5n-3n bytes, with O(m+(k+log log n) log log n) time, on typical texts. The index is up to 25 times faster than the best previous compressed solutions, and requires at most 5% more space in practice (and in some cases as little as one half). Apart from replacing classical by compressed data structure… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
34
0

Year Published

2013
2013
2015
2015

Publication Types

Select...
5
1

Relationship

0
6

Authors

Journals

citations
Cited by 23 publications
(34 citation statements)
references
References 26 publications
0
34
0
Order By: Relevance
“…We use various document collections, following previous work [13,7] and exploring different aspects of statistical compressibility, size, number of documents, and repetitiveness: ClueWiki (English, few large documents), DNA (synthetic, mildly repetitive with 5% mutations among documents), KGS (Go game records), Wiki (more and shorter documents), Proteins (many more documents, almost incompressible), and TodoCL (a snapshot of the Chilean Web, with real queries, used to measure quality). Table 1 shows their main characteristics (column "compress" shows how the LZ78-based Unix Compress program compresses them).…”
Section: Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…We use various document collections, following previous work [13,7] and exploring different aspects of statistical compressibility, size, number of documents, and repetitiveness: ClueWiki (English, few large documents), DNA (synthetic, mildly repetitive with 5% mutations among documents), KGS (Go game records), Wiki (more and shorter documents), Proteins (many more documents, almost incompressible), and TodoCL (a snapshot of the Chilean Web, with real queries, used to measure quality). Table 1 shows their main characteristics (column "compress" shows how the LZ78-based Unix Compress program compresses them).…”
Section: Resultsmentioning
confidence: 99%
“…Navarro and Nekrich [12] reduced the time to the optimal O(m + k). Konow and Navarro [7] implemented this index, obtaining an index that uses 20-30 bits per character (bpc) 1 and answers top-k queries in k to 4k microseconds (µsec). Their time complexity is O(m + (k + log log n) log log n) with high probability, on statistically typical texts [15].…”
Section: Introductionmentioning
confidence: 99%
“…Konow and Navarro [21] achieved O(p + (lg lg n) 2 + k lg lg n) time within |CSA| + (n lg D + 4n lg lg n)(1 + o(1)) bits, but the result holds only almost surely on typical texts, not in the worst case. Their index, on the other hand, turns out to be very competitive in practice.…”
Section: Sourcementioning
confidence: 99%
“…For the first part, we can use a data structure [21], which is built in O(n lg n) time and answers a top-k query in O((k+ lg n) lg n) time once the locus of the pattern is given. 4 Adding the O(k lg n) times over all the O(n/c) sampled nodes, for c = k lg 2 k lg ε n, gives O(n lg 1−ε n/ lg 2 k), which added over all the powers of 2 for k gives O(n lg 1−ε n).…”
Section: Constructionmentioning
confidence: 99%
See 1 more Smart Citation