Faster Compressed Suffix Trees for Repetitive Text Collections

Navarro, Gonzalo; Ordóñez, Alberto

doi:10.1007/978-3-319-07959-2_36

Cited by 19 publications

(22 citation statements)

References 38 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…An extension of the RLFM-index [80] still needs O(n/s) space to carry out most of the suffix tree operations in time O(s log n). Some variants that are designed for repetitive text collections [1,92] are heuristic and do not offer worst-case guarantees. Only recently a compressed suffix tree was presented [8] that uses O(e) space and carries out operations in O(log n) time.…”

Section: Compressed Suffix Treesmentioning

confidence: 99%

“…The first compressed suffix tree for repetitive collections was built on runs [80], but just like the self-index, it needed Θ(n/s) space to obtain O(s log n) time in key operations like accessing SA. Other compressed suffix trees for repetitive collections appeared later [1,92,29], but they do not offer formal space guarantees (see later). A recent one, instead, uses O(e) words and supports a number of operations in time typically O(log n) [8].…”

Section: A Run-length Compressed Suffix Treementioning

confidence: 99%

See 1 more Smart Citation

Fully Functional Suffix Trees and Optimal Text Searching in BWT-Runs Bounded Space

Gagie

Navarro

Prezza

2020

J. ACM

Self Cite

128

142

View full text Add to dashboard Cite

Indexing highly repetitive texts -such as genomic databases, software repositories and versioned text collections -has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is r, the number of runs in their Burrows-Wheeler Transforms (BWTs). One of the earliest indexes for repetitive collections, the Run-Length FM-index, used O(r) space and was able to efficiently count the number of occurrences of a pattern of length m in the text (in loglogarithmic time per pattern symbol, with current techniques). However, it was unable to locate the positions of those occurrences efficiently within a space bounded in terms of r. Since then, a number of other indexes with space bounded by other measures of repetitivenessthe number of phrases in the Lempel-Ziv parse, the size of the smallest grammar generating (only) the text, the size of the smallest automaton recognizing the text factors -have been proposed for efficiently locating, but not directly counting, the occurrences of a pattern. In this paper we close this long-standing problem, showing how to extend the Run-Length FM-index so that it can locate the occ occurrences efficiently within O(r) space (in loglogarithmic time each), and reaching optimal time, O(m + occ), within O(r log log w (σ + n/r)) space, for a text of length n over an alphabet of size σ on a RAM machine with words of w = Ω(log n) bits. Within that space, our index can also count in optimal time, O(m). Multiplying the space by O(w/ log σ), we support count and locate in O( m log(σ)/w ) and O( m log(σ)/w + occ) time, which is optimal in the packed setting and had not been obtained before in compressed space. We also describe a structure using O(r log(n/r)) space that replaces the text and extracts any text substring of length in almost-optimal time O(log(n/r) + log(σ)/w). Within that space, we similarly provide direct access to suffix array, inverse suffix array, and longest common prefix array cells, and extend these capabilities to full suffix tree functionality, typically in O(log(n/r)) time per operation. Our experiments show that our O(r)-space index outperforms the space-competitive alternatives by 1-2 orders of magnitude.

show abstract

Section: Compressed Suffix Treesmentioning

confidence: 99%

Section: A Run-length Compressed Suffix Treementioning

confidence: 99%

Fully Functional Suffix Trees and Optimal Text Searching in BWT-Runs Bounded Space

Gagie

Navarro

Prezza

2020

J. ACM

Self Cite

128

142

View full text Add to dashboard Cite

show abstract

“…An extension of the RLFM-index [65] still needs O(n/s) space to carry out most of the suffix tree operations in time O(s log n). Some variants that are designed for repetitive text collections [1,76] are heuristic and do not offer worstcase guarantees. Only recently a compressed suffix tree was presented [5] that uses O(e) space and carries out operations in O(log n) time.…”

Section: Compressed Suffix Treesmentioning

confidence: 99%

Optimal-Time Text Indexing in BWT-runs Bounded Space

Gagie¹,

Navarro²,

Prezza³

2018

Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms

Self Cite

126

View full text Add to dashboard Cite

Indexing highly repetitive texts -such as genomic databases, software repositories and versioned text collections -has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is r, the number of runs in their Burrows-Wheeler Transform (BWT). One of the earliest indexes for repetitive collections, the Run-Length FMindex, used O(r) space and was able to efficiently count the number of occurrences of a pattern of length m in the text (in loglogarithmic time per pattern symbol, with current techniques). However, it was unable to locate the positions of those occurrences efficiently within a space bounded in terms of r. Since then, a number of other indexes with space bounded by other measures of repetitiveness -the number of phrases in the LempelZiv parse, the size of the smallest grammar generating the text, the size of the smallest automaton recognizing the text factors -have been proposed for efficiently locating, but not directly counting, the occurrences of a pattern. In this paper we close this long-standing problem, showing how to extend the Run-Length FMindex so that it can locate the occ occurrences efficiently within O(r) space (in loglogarithmic time each), and reaching optimal time O(m + occ) within O(r log(n/r)) space, on a RAM machine with words of w = Ω(log n) bits. Raising the space to O(rw log σ (n/r)), we support locate in O(m log(σ)/w + occ) time, which is optimal in the packed setting and had not been obtained before in compressed space. We also describe a structure using O(r log(n/r)) space that replaces the text and efficiently extracts any text substring, with an O(log(n/r)) additive time penalty over the optimum. Preliminary experiments show that our new structure outperforms the alternatives by orders of magnitude in the space/time tradeoff map.

show abstract

“…Components (2) and (3), which are usually less relevant in terms of space, may become dominant if they are represented without exploiting repetitiveness. For (2), we compare GCT, a tree representation aimed at repetitive topologies [27], with a classical representation (FF [1]). For (3), we will use our new repetitionaware sequence representations, comparing them with the alternative proposed in SXSI (MATRIX, using one compressed bitmap per tag) and a WTH representation.…”

Section: Application: Xpath Queries On Highly Repetitive Collectionsmentioning

confidence: 99%

Grammar Compressed Sequences with Rank/Select Support

Navarro

Ordóñez

2014

String Processing and Information Retrieval

Self Cite

View full text Add to dashboard Cite

Abstract. Sequence representations supporting not only direct access to their symbols, but also rank/select operations, are a fundamental building block in many compressed data structures. In several recent applications, the need to represent highly repetitive sequences arises, where statistical compression is ineffective. We introduce grammar-based representations for repetitive sequences, which use up to 10% of the space needed by representations based on statistical compression, and support direct access and rank/select operations within tens of microseconds.

show abstract

Faster Compressed Suffix Trees for Repetitive Text Collections

Cited by 19 publications

References 38 publications

Fully Functional Suffix Trees and Optimal Text Searching in BWT-Runs Bounded Space

Fully Functional Suffix Trees and Optimal Text Searching in BWT-Runs Bounded Space

Optimal-Time Text Indexing in BWT-runs Bounded Space

Grammar Compressed Sequences with Rank/Select Support

Contact Info

Product

Resources

About