A Faster Grammar-Based Self-index

Gagie, Travis; Gawrychowski, Paweł; Kärkkäinen, Juha; Nekrich, Yakov

doi:10.1007/978-3-642-28332-1_21

Cited by 81 publications

(64 citation statements)

References 42 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Many proposals since then aimed at reducing the locating time by building on other compression methods that perform well on repetitive texts: indexes based on the Lempel-Ziv parse [76] of T , with size bounded in terms of the number z of phrases [73,42,97,9,88,15,23]; indexes based on the smallest context-free grammar (or an approximation thereof) that generates T and only T [68,21], with size bounded in terms of the size g of the grammar [25,26,41,89]; and indexes based on the size e of the smallest automaton (CDAWG) [18] recognizing the substrings of T [9,111,7]. Table 1 summarizes the pareto-optimal achievements.…”

Section: Related Workmentioning

confidence: 99%

Fully Functional Suffix Trees and Optimal Text Searching in BWT-Runs Bounded Space

Gagie

Navarro

Prezza

2020

J. ACM

Self Cite

128

142

View full text Add to dashboard Cite

Indexing highly repetitive texts -such as genomic databases, software repositories and versioned text collections -has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is r, the number of runs in their Burrows-Wheeler Transforms (BWTs). One of the earliest indexes for repetitive collections, the Run-Length FM-index, used O(r) space and was able to efficiently count the number of occurrences of a pattern of length m in the text (in loglogarithmic time per pattern symbol, with current techniques). However, it was unable to locate the positions of those occurrences efficiently within a space bounded in terms of r. Since then, a number of other indexes with space bounded by other measures of repetitivenessthe number of phrases in the Lempel-Ziv parse, the size of the smallest grammar generating (only) the text, the size of the smallest automaton recognizing the text factors -have been proposed for efficiently locating, but not directly counting, the occurrences of a pattern. In this paper we close this long-standing problem, showing how to extend the Run-Length FM-index so that it can locate the occ occurrences efficiently within O(r) space (in loglogarithmic time each), and reaching optimal time, O(m + occ), within O(r log log w (σ + n/r)) space, for a text of length n over an alphabet of size σ on a RAM machine with words of w = Ω(log n) bits. Within that space, our index can also count in optimal time, O(m). Multiplying the space by O(w/ log σ), we support count and locate in O( m log(σ)/w ) and O( m log(σ)/w + occ) time, which is optimal in the packed setting and had not been obtained before in compressed space. We also describe a structure using O(r log(n/r)) space that replaces the text and extracts any text substring of length in almost-optimal time O(log(n/r) + log(σ)/w). Within that space, we similarly provide direct access to suffix array, inverse suffix array, and longest common prefix array cells, and extend these capabilities to full suffix tree functionality, typically in O(log(n/r)) time per operation. Our experiments show that our O(r)-space index outperforms the space-competitive alternatives by 1-2 orders of magnitude.

show abstract

Section: Related Workmentioning

confidence: 99%

Fully Functional Suffix Trees and Optimal Text Searching in BWT-Runs Bounded Space

Gagie

Navarro

Prezza

2020

J. ACM

Self Cite

128

142

View full text Add to dashboard Cite

show abstract

“…There have been some indexes aimed at performing pattern matching on repetitive collections based on those techniques [17,16,8,10,13]. However, they do not provide the versatile suffix tree functionality, and they do not seem to yield a way to obtain it.…”

Section: Introductionmentioning

confidence: 99%

Faster Compressed Suffix Trees for Repetitive Text Collections

Navarro

Ordóñez

2014

Experimental Algorithms

View full text Add to dashboard Cite

Abstract. Recent compressed suffix trees targeted to highly repetitive text collections reach excellent compression performance, but operation times in the order of milliseconds. We design a new suffix tree representation for this scenario that still achieves very low space usage, only slightly larger than the best previous one, but supports the operations within microseconds. This puts the data structure in the same performance level of compressed suffix trees designed for standard text collections, which on repetitive collections use many times more space than our new structure.

show abstract

“…For example, compressed pattern matching [33], grammar-based self-index [34,35], random accessible data structure [36] and so on. One property of our grammar is that the height of the parse tree is bounded by O(log n); another property is that our algorithm can find long common substrings without Ω(n) space data structures.…”

Section: Discussionmentioning

confidence: 99%

An Online Algorithm for Lightweight Grammar-Based Compression

Maruyama

Takeda

Nakahara

et al. 2011

2011 First International Conference on Data Compression, Communications and Processing

View full text Add to dashboard Cite

Grammar-based compression is a well-studied technique to construct a context-free grammar (CFG) deriving a given text uniquely. In this work, we propose an online algorithm for grammar-based compression. Our algorithm guarantees O(log 2 n)-approximation ratio for the minimum grammar size, where n is an input size, and it runs in input linear time and output linear space. In addition, we propose a practical encoding, which transforms a restricted CFG into a more compact representation. Experimental results by comparison with standard compressors demonstrate that our algorithm is especially effective for highly repetitive text.

show abstract

A Faster Grammar-Based Self-index

Cited by 81 publications

References 42 publications

Fully Functional Suffix Trees and Optimal Text Searching in BWT-Runs Bounded Space

Fully Functional Suffix Trees and Optimal Text Searching in BWT-Runs Bounded Space

Faster Compressed Suffix Trees for Repetitive Text Collections

An Online Algorithm for Lightweight Grammar-Based Compression

Contact Info

Product

Resources

About