2010
DOI: 10.1089/cmb.2009.0169
|View full text |Cite
|
Sign up to set email alerts
|

Storage and Retrieval of Highly Repetitive Sequence Collections

Abstract: A repetitive sequence collection is a set of sequences which are small variations of each other. A prominent example are genome sequences of individuals of the same or close species, where the differences can be expressed by short lists of basic edit operations. Flexible and efficient data analysis on such a typically huge collection is plausible using suffix trees. However, the suffix tree occupies much space, which very soon inhibits in-memory analyses. Recent advances in full-text indexing reduce the space … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

2
247
0

Year Published

2010
2010
2019
2019

Publication Types

Select...
6
1

Relationship

5
2

Authors

Journals

citations
Cited by 189 publications
(249 citation statements)
references
References 33 publications
2
247
0
Order By: Relevance
“…2 includes operations that are exclusive of suffix trees, and access the other CSA components. The suffix link operation (sLink) requires, in our case, to map nodes to suffix array leaves, compute function Ψ on the RLCSA [20], map back to suffix tree nodes, and compute an LCA. Our GCT and NPR take near 200 µsec to complete this operation, whereas NPR-Repet and FCST use 2-5 msec, an order of magnitude slower.…”
Section: Resultsmentioning
confidence: 99%
See 2 more Smart Citations
“…2 includes operations that are exclusive of suffix trees, and access the other CSA components. The suffix link operation (sLink) requires, in our case, to map nodes to suffix array leaves, compute function Ψ on the RLCSA [20], map back to suffix tree nodes, and compute an LCA. Our GCT and NPR take near 200 µsec to complete this operation, whereas NPR-Repet and FCST use 2-5 msec, an order of magnitude slower.…”
Section: Resultsmentioning
confidence: 99%
“…Note that, within this space, CSAs can reproduce any substring of T , so T does not need to be stored separately. Mäkinen et al [20] introduced the run-length CSA, or RLCSA, which compresses better when T is repetitive (i.e., it can be represented as the concatenation of a few different substrings). Statistical compressors do not take proper advantage of repetitiveness [16].…”
Section: Compressed Suffix Treesmentioning
confidence: 99%
See 1 more Smart Citation
“…More importantly, repetitions in T induce long runs in Ψ, and hence a smaller r [25]. An exact bound has been elusive, but Mäkinen et al [25] gave an average-case upper bound for r: If T is formed by a random base sequence of length n n and then other sequences that have m random mutations (which include indels, replacements, block moves, etc.) with respect to the base sequence, then r is at most n + O(m log σ n) on average.…”
Section: Re-pair and Repetition-aware Csasmentioning
confidence: 99%
“…This space is that of the suffix array sampling, which is related to the speed of computing the contents of suffix array cells (and hence computing LCP values). It is interesting that Mäkinen et al [25] proposed a solution to compress this array that proved impractical for the small databases we are experimenting with, but whose asymptotic properties ensure that will become practical for sufficiently large and repetitive collections. Let us discuss the NPR operations now.…”
Section: Collectionmentioning
confidence: 99%