2015 Data Compression Conference 2015
DOI: 10.1109/dcc.2015.55
|View full text |Cite
|
Sign up to set email alerts
|

Document Counting in Compressed Space

Abstract: We address the problem of counting the number of strings in a collection where a given pattern appears, which has applications in information retrieval and data mining. Existing solutions are in a theoretical stage. In this paper we implement these solutions and explore compressed variants, aiming to reduce data structure size. Our main result is to uncover some unexpected compressibility properties of the fastest known data structure for the problem. By taking advantage of these properties, we can reduce the … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
5
0

Year Published

2016
2016
2020
2020

Publication Types

Select...
4
1
1

Relationship

2
4

Authors

Journals

citations
Cited by 6 publications
(5 citation statements)
references
References 20 publications
0
5
0
Order By: Relevance
“…We use another array A[0, |V | − 1] to store the number of additional values in each node v i ∈ V as A[i] = |G .value(v i )| − 1, and encode it as a bitvector B A in the same way as array R above. The number of distinct values in range The bitvectors are often highly compressible [12], but GCSA already uses one of the compression schemes implicitly when it prunes the de Bruijn graph.…”
Section: E Suffix Tree Of a Path Graphmentioning
confidence: 99%
See 1 more Smart Citation
“…We use another array A[0, |V | − 1] to store the number of additional values in each node v i ∈ V as A[i] = |G .value(v i )| − 1, and encode it as a bitvector B A in the same way as array R above. The number of distinct values in range The bitvectors are often highly compressible [12], but GCSA already uses one of the compression schemes implicitly when it prunes the de Bruijn graph.…”
Section: E Suffix Tree Of a Path Graphmentioning
confidence: 99%
“…We set R[j] = 0 for any subsequent visits to the same node. Range The bitvectors are often highly compressible [12], but GCSA already uses one of the compression schemes implicitly when it prunes the de Bruijn graph.…”
Section: E Suffix Tree Of a Path Graphmentioning
confidence: 99%
“…This article collects our earlier results appearing in CPM 2013 (Gagie et al, 2013), ESA 2014 (Navarro et al, 2014a), and DCC 2015 (Gagie et al, 2015), where we focused on exploiting repetitiveness in different ways to handle different document retrieval problems. Here we present them in a unified form, considering the application of two new techniques (ILCP and PDL) and an existing one (Sadakane, 2007) to the three problems (document listing, topk retrieval, and document counting), and showing how they interact (e.g., the need to use fast document counting to choose the best document listing method).…”
Section: Introductionmentioning
confidence: 99%
“…However, although this approach is straightforward and have been used in different applications (e.g. [100,67,5,30,56,52,31,79,29,106]), it has some drawbacks, which may deteriorate both the theoretical bounds and the practical behavior of many suffix sorting algorithms.…”
Section: Motivationmentioning
confidence: 99%
“…Although both approaches are straightforward and have been used in different applications (e.g. [100,67,5,30,56,52,31,79,29,106]), they have some drawbacks. The first alternative increases the alphabet size of T cat by the number of strings, which may deteriorate the theoretical bounds of many algorithms.…”
Section: Introductionmentioning
confidence: 99%