CSA++: Fast Pattern Search for Large Alphabets

Gog, Simon; Moffat, Alistair; Petri, Matthias

doi:10.1137/1.9781611974768.6

Cited by 10 publications

(6 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Figure 8 shows the average query time for all datasets when using the implementations of Section 5 with ℓ = |𝑃 |. For all datasets and |𝑃 | ≥ 64, BDA-index I and II are up to several orders of magnitude faster than the compressed indexes, especially for large alphabets, which is consistent with the observations made in [29,40]. Notably, for all datasets and ℓ values, BDA-index I and II are even faster than the SA.…”

Section: Query Timesupporting

confidence: 77%

Text Indexing for Long Patterns: Anchors are All you Need

Ayad¹,

Loukides

Pissis

2023

Proc. VLDB Endow.

View full text Add to dashboard Cite

In many real-world database systems, a large fraction of the data is represented by strings: sequences of letters over some alphabet. This is because strings can easily encode data arising from different sources. It is often crucial to represent such string datasets in a compact form but also to simultaneously enable fast pattern matching queries. This is the classic text indexing problem. The four absolute measures anyone should pay attention to when designing or implementing a text index are: (i) index space; (ii) query time; (iii) construction space; and (iv) construction time. Unfortunately, however, most (if not all) widely-used indexes (e.g., suffix tree, suffix array, or their compressed counterparts) are not optimized for all four measures simultaneously, as it is difficult to have the best of all four worlds. Here, we take an important step in this direction by showing that text indexing with locally consistent anchors (lc-anchors) offers remarkably good performance in all four measures, when we have at hand a lower bound l on the length of the queried patterns --- which is arguably a quite reasonable assumption in practical applications. Specifically, we improve on the construction of the index proposed by Loukides and Pissis, which is based on bidirectional string anchors (bd-anchors), a new type of lc-anchors, by: (i) designing an average-case linear-time algorithm to compute bd-anchors; and (ii) developing a semi-external-memory implementation to construct the index in small space using near-optimal work. We then present an extensive experimental evaluation, based on the four measures, using real benchmark datasets. The results show that, for long patterns, the index constructed using our improved algorithms compares favorably to all classic indexes: (compressed) suffix tree; (compressed) suffix array; and the FM-index.

show abstract

Section: Query Timesupporting

confidence: 77%

Text Indexing for Long Patterns: Anchors are All you Need

Ayad¹,

Loukides

Pissis

2023

Proc. VLDB Endow.

View full text Add to dashboard Cite

show abstract

“…Additionally we use a word parsing of the TREC gov2 collection [7]. Table 1 tion and benchmarks are publicly available 5 and contain all parameters left out here due to space constrains.…”

Section: Methodsmentioning

confidence: 99%

“…To be more specific, we use uncompressed (bit vector) and compressed (rrr vector) bit vectors for the wavelet tree of the character based CSA. For word-based indexes we use a recently presented CSA designed for large alphabets [5].…”

Section: Methodsmentioning

confidence: 99%

Elias-Fano meets Single-Term Top-k Document Retrieval

Labeit¹,

Gog²

2017

2017 Proceedings of the Ninteenth Workshop on Algorithm Engineering and Experiments (ALENEX)

Self Cite

View full text Add to dashboard Cite

A fundamental problem in Information Retrieval is to determine the k most relevant documents of a collection for a given query word or phrase P . In a recent result, Navarro and Nekrich [SODA 2012] showed that this problem can be solved in optimal time complexity of O(|P | + k) with a precomputed linear-space index. The size of this optimal-time index was estimated to be 80 times the collection size, rendering it not to be practical. In subsequent work, Navarro and Konow [DCC 2013] and Gog and Navarro [ALENEX 2015] created a practical version with slightly worse query time guarantees but reduced the space to 2.5 − 3 times the collection size. The index is conceptually simple and is divided in five components. In this paper we show how the n log N bits required by the usually largest component -the so called repetition array -can be reduced to n log log n + O(n), where n is the size of the collection and N the number of documents. As the overall query time complexity matches the one of the old index, we achieve a theoretically superior time-space trade-off. We explore the practical properties of the improved index in a detailed experimental study and compare to the previously established baseline. Index sizes are now between 1.5 − 2 times the collection size while query speed is comparable to the larger indexes. We also show that the new approach automatically adapts to highly repetitive text collections, which are for instance produced by version control systems.

show abstract

“…FM-GMR [20] and FM-AP-HYB [21] are FM-index variants that are tailored for huge σ and that support O(log log σ) rank operation (faster than the O(log σ) of UFMI); they are available in sdsl-lite library. These were the fastest (FM-GMR) and the smallest (FM-AP-HYB) methods for huge σ in a recent benchmark [22].…”

Section: Comparison Of Rml With Melmentioning

confidence: 99%

CiNCT: Compression and Retrieval for Massive Vehicular Trajectories via Relative Movement Labeling

Koide

Tadokoro

Xiao

et al. 2018

2018 IEEE 34th International Conference on Data Engineering (ICDE)

View full text Add to dashboard Cite

In this paper, we present a compressed data structure for moving object trajectories in a road network, which are represented as sequences of road edges. Unlike existing compression methods for trajectories in a network, our method supports pattern matching and decompression from an arbitrary position while retaining a high compressibility with theoretical guarantees. Specifically, our method is based on FM-index, a fast and compact data structure for pattern matching. To enhance the compression, we incorporate the sparsity of road networks into the data structure. In particular, we present the novel concepts of relative movement labeling and PseudoRank, each contributing to significant reductions in data size and query processing time. Our theoretical analysis and experimental studies reveal the advantages of our proposed method as compared to existing trajectory compression methods and FM-index variants.

show abstract

CSA++: Fast Pattern Search for Large Alphabets

Cited by 10 publications

References 23 publications

Text Indexing for Long Patterns: Anchors are All you Need

Text Indexing for Long Patterns: Anchors are All you Need

Elias-Fano meets Single-Term Top-k Document Retrieval

CiNCT: Compression and Retrieval for Massive Vehicular Trajectories via Relative Movement Labeling

Contact Info

Product

Resources

About