Storage and Retrieval of Highly Repetitive Sequence Collections

Mäkinen, Veli; Navarro, Gonzalo; Sirén, Jouni; Välimäki, Niko

doi:10.1089/cmb.2009.0169

Cited by 189 publications

(249 citation statements)

References 33 publications

Supporting

Mentioning

247

Contrasting

Order By: Relevance

“…2 includes operations that are exclusive of suffix trees, and access the other CSA components. The suffix link operation (sLink) requires, in our case, to map nodes to suffix array leaves, compute function Ψ on the RLCSA [20], map back to suffix tree nodes, and compute an LCA. Our GCT and NPR take near 200 µsec to complete this operation, whereas NPR-Repet and FCST use 2-5 msec, an order of magnitude slower.…”

Section: Resultsmentioning

confidence: 99%

“…Note that, within this space, CSAs can reproduce any substring of T , so T does not need to be stored separately. Mäkinen et al [20] introduced the run-length CSA, or RLCSA, which compresses better when T is repetitive (i.e., it can be represented as the concatenation of a few different substrings). Statistical compressors do not take proper advantage of repetitiveness [16].…”

Section: Compressed Suffix Treesmentioning

confidence: 99%

“…However, they do not provide the versatile suffix tree functionality, and they do not seem to yield a way to obtain it. Instead, the so-called run-length compressed suffix array [20] (run-length CSA or RLCSA), although based in principle on weaker compression techniques, yields a data structure that is useful to achieve CSTs for repetitive collections (because CST implementations always build on a CSA).…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Faster Compressed Suffix Trees for Repetitive Text Collections

Navarro

Ordóñez

2014

Experimental Algorithms

Self Cite

View full text Add to dashboard Cite

Abstract. Recent compressed suffix trees targeted to highly repetitive text collections reach excellent compression performance, but operation times in the order of milliseconds. We design a new suffix tree representation for this scenario that still achieves very low space usage, only slightly larger than the best previous one, but supports the operations within microseconds. This puts the data structure in the same performance level of compressed suffix trees designed for standard text collections, which on repetitive collections use many times more space than our new structure.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Compressed Suffix Treesmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Faster Compressed Suffix Trees for Repetitive Text Collections

Navarro

Ordóñez

2014

Experimental Algorithms

Self Cite

View full text Add to dashboard Cite

show abstract

“…More importantly, repetitions in T induce long runs in Ψ, and hence a smaller r [25]. An exact bound has been elusive, but Mäkinen et al [25] gave an average-case upper bound for r: If T is formed by a random base sequence of length n n and then other sequences that have m random mutations (which include indels, replacements, block moves, etc.) with respect to the base sequence, then r is at most n + O(m log σ n) on average.…”

Section: Re-pair and Repetition-aware Csasmentioning

confidence: 99%

“…This space is that of the suffix array sampling, which is related to the speed of computing the contents of suffix array cells (and hence computing LCP values). It is interesting that Mäkinen et al [25] proposed a solution to compress this array that proved impractical for the small databases we are experimenting with, but whose asymptotic properties ensure that will become practical for sufficiently large and repetitive collections. Let us discuss the NPR operations now.…”

Section: Collectionmentioning

confidence: 99%

Practical Compressed Suffix Trees

Cánovas

Navarro

2010

Experimental Algorithms

Self Cite

View full text Add to dashboard Cite

Abstract:The suffix tree is an extremely important data structure in bioinformatics. Classical implementations require much space, which renders them useless to handle large sequence collections. Recent research has obtained various compressed representations for suffix trees, with widely different space-time tradeoffs. In this paper we show how the use of range min-max trees yields novel representations achieving practical space/time tradeoffs. In addition, we show how those trees can be modified to index highly repetitive collections, obtaining the first compressed suffix tree representation that effectively adapts to that scenario.

show abstract

Fast in‐memory XPath search using compressed indexes

et al. 2013

Self Cite

View full text Add to dashboard Cite

Artículo de publicación ISIExtensible Markup Language (XML) documents consist of text data plus structured data (markup). XPath allows to query both text and structure. Evaluating such hybrid queries is challenging. We present a system for in-memory evaluation of XPath search queries, that is, queries with text and structure predicates, yet without advanced features such as backward axes, arithmetics, and joins. We show that for this query fragment, which contains Forward Core XPath, our system, dubbed Succinct XML Self-Index (‘SXSI’), outperforms existing systems by 1–3 orders of magnitude. SXSI is based on state-of-the-art indexes for text and structure data. It combines two novelties. On one hand, it represents the XML data in a compact indexed form, which allows it to handle larger collections in main memory while supporting powerful search and navigation operations over the text and the structure. On the other hand, it features an execution engine that uses tree automata and cleverly chooses evaluation orders that leverage the speeds of the respective indexes. SXSI is modular and allows seamless replacement of its indexes. This is demonstrated through experiments with (1) a text index specialized for search of bio sequences, and (2) a word-based text index specialized for natural language search.Fondecyt, Chile 1-11006

show abstract

Storage and Retrieval of Highly Repetitive Sequence Collections

Cited by 189 publications

References 33 publications

Faster Compressed Suffix Trees for Repetitive Text Collections

Faster Compressed Suffix Trees for Repetitive Text Collections

Practical Compressed Suffix Trees

Fast in‐memory XPath search using compressed indexes

Contact Info

Product

Resources

About