2017 Proceedings of the Ninteenth Workshop on Algorithm Engineering and Experiments (ALENEX) 2017
DOI: 10.1137/1.9781611974768.6
|View full text |Cite
|
Sign up to set email alerts
|

CSA++: Fast Pattern Search for Large Alphabets

Abstract: Indexed pattern search in text has been studied for many decades. For small alphabets, the FM-Index provides unmatched performance for Count operations, in terms of both space required and search speed. For large alphabets -for example, when the tokens are words -the situation is more complex, and FM-Index representations are compact, but potentially slow. In this paper we apply recent innovations from the field of inverted indexing and document retrieval to compressed pattern search, including for alphabets i… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
4
0

Year Published

2017
2017
2024
2024

Publication Types

Select...
5
2
1

Relationship

1
7

Authors

Journals

citations
Cited by 10 publications
(6 citation statements)
references
References 23 publications
2
4
0
Order By: Relevance
“…Figure 8 shows the average query time for all datasets when using the implementations of Section 5 with ℓ = |𝑃 |. For all datasets and |𝑃 | ≥ 64, BDA-index I and II are up to several orders of magnitude faster than the compressed indexes, especially for large alphabets, which is consistent with the observations made in [29,40]. Notably, for all datasets and ℓ values, BDA-index I and II are even faster than the SA.…”
Section: Query Timesupporting
confidence: 77%
“…Figure 8 shows the average query time for all datasets when using the implementations of Section 5 with ℓ = |𝑃 |. For all datasets and |𝑃 | ≥ 64, BDA-index I and II are up to several orders of magnitude faster than the compressed indexes, especially for large alphabets, which is consistent with the observations made in [29,40]. Notably, for all datasets and ℓ values, BDA-index I and II are even faster than the SA.…”
Section: Query Timesupporting
confidence: 77%
“…Additionally we use a word parsing of the TREC gov2 collection [7]. Table 1 tion and benchmarks are publicly available 5 and contain all parameters left out here due to space constrains.…”
Section: Methodsmentioning
confidence: 99%
“…To be more specific, we use uncompressed (bit vector) and compressed (rrr vector) bit vectors for the wavelet tree of the character based CSA. For word-based indexes we use a recently presented CSA designed for large alphabets [5].…”
Section: Methodsmentioning
confidence: 99%
“…FM-GMR [20] and FM-AP-HYB [21] are FM-index variants that are tailored for huge σ and that support O(log log σ) rank operation (faster than the O(log σ) of UFMI); they are available in sdsl-lite library. These were the fastest (FM-GMR) and the smallest (FM-AP-HYB) methods for huge σ in a recent benchmark [22].…”
Section: Comparison Of Rml With Melmentioning
confidence: 99%