Replacing suffix trees with enhanced suffix arrays

Abouelhoda, Mohamed; Kurtz, Stefan; Ohlebusch, Enno

doi:10.1016/s1570-8667(03)00065-0

Cited by 572 publications

(630 citation statements)

References 10 publications

Supporting

Mentioning

600

Contrasting

Unclassified

Order By: Relevance

“…We also compared to the enhanced suffix array ESA [Abouelhoda et al 2004]; we used the implementation that is plugged into the Vmatch software package 6 . For suffix array construction, we used the bpr algorithm [Schürmann and Stoye 2005] that is the currently the fastest construction algorithm in practice.…”

Section: Resultsmentioning

confidence: 99%

Engineering a Compressed Suffix Tree Implementation

Välimäki

Gerlach

Dixit

et al. 2007

Experimental Algorithms

View full text Add to dashboard Cite

Suffix tree is one of the most important data structures in string algorithms and biological sequence analysis. Unfortunately, when it comes to implementing those algorithms and applying them to real genomic sequences, often the main memory size becomes the bottleneck. This is easily explained by the fact that while a DNA sequence of length n from alphabet Σ = {A, C, G, T } can be stored in n log |Σ| = 2n bits, its suffix tree occupies O(n log n) bits. In practice, the size difference easily reaches factor 50.We report on an implementation of the compressed suffix tree very recently proposed by Sadakane (Theory of Computing Systems, in press). The compressed suffix tree occupies space proportional to the text size, i.e. O(n log |Σ|) bits, and supports all typical suffix tree operations with at most log n factor slowdown. Our experiments show that, e.g. on a 10 MB DNA sequence, the compressed suffix tree takes 10% of the space of the normal suffix tree. At the same time, a representative algorithm is slowed down by factor 30.Our implementation follows the original proposal in spirit, but some internal parts are tailored towards practical implementation. Our construction algorithm has time requirement O(n log n log |Σ|) and uses closely the same space as the final structure while constructing it: on the 10 MB DNA sequence, the maximum space usage during construction is only 1.5 times the final product size. As by-products, we develop a method to create Succinct Suffix Array directly from Burrows-Wheeler transform and a space-efficient version of suffixes-insertion algorithm to build balanced parentheses representation of suffix tree from LCP information.

show abstract

Section: Resultsmentioning

confidence: 99%

Engineering a Compressed Suffix Tree Implementation

Välimäki

Gerlach

Dixit

et al. 2007

Experimental Algorithms

View full text Add to dashboard Cite

show abstract

“…They are extensively used not only for full-text searching, but also for more complex pattern matching and discovery problems. It has been shown that a suffix array can support the same functionality as a suffix tree, provided that extra data is stored [1], but it needs no extra data to support simple pattern searches. In practice, suffix arrays are in many cases preferred over suffix trees, because of their smaller space requirement and better locality of access.…”

Section: The Importance Of Suffix Arrays In Text Processing and Searcmentioning

confidence: 99%

“…We have illustrated this in Figure 2(e), where we show (in a separate array below each local suffix array) the global position of the corresponding local position. Thus, for instance, we have that SA 3 [1] corresponds to the global entry SA [1], as indicated by 1 �. An important difference with the Multiplexed strategy is that, for the GLocal approach, the sampling of the global array is not uniform, as it can be seen in Figure 3(b).…”

Section: Global Suffix Array With Local Textmentioning

confidence: 99%

Distributed text search using suffix arrays

Arroyuelo¹,

Bonacic²,

Gil-Costa³

et al. 2014

Parallel Computing

View full text Add to dashboard Cite

Text search is a classical problem in Computer Science, with many data-intensive applications. For this problem, suffix arrays are among the most widely known and used data structures, enabling fast searches for phrases, terms, substrings and regular expressions in large texts. Potential application domains for these operations include large-scale search services, such as Web search engines, where it is necessary to efficiently process intensive-traffic streams of on-line queries. This paper proposes strategies to enable such services by means of suffix arrays. We introduce techniques for deploying suffix arrays on clusters of distributed-memory processors and then study the processing of multiple queries on the distributed data structure. Even though the cost of individual search operations in sequential (non-distributed) suffix arrays is low in practice, the problem of processing multiple queries on distributed-memory systems, so that hardware resources are used efficiently, is relevant to services aimed at achieving high query throughput at low operational costs. Our theoretical and experimental performance studies show that our proposals are suitable solutions for building efficient and scalable on-line search services based on suffix arrays. IntroductionIn the last decade, the design of efficient data structures and algorithms for textual databases and related applications has received a great deal of attention, due to the rapid growth of the amount of text data available from different sources. Typical applications support text searches over big text collections in a client-server fashion, where the user queries are answered by a dedicated server [15]. The server efficiency-in terms of running time-is of paramount importance in cases where the services demanded by clients generate a heavy work load. A feasible way to overcome the limitations of sequential computers is to resort to the use of several computers, or processors, which work together to serve the ever increasing client demands [19].One such approach to efficient parallelization is to distribute the data onto the processors, in such a way that it becomes feasible to exploit locality via parallel processing of user requests, each on a subset of the data. As opposed to shared-memory models, this distributed-memory model provides the benefit of better * Corresponding author. Address: Av. España 1680, Valparaíso, Chile. Phone: +56 2 432 6722. Fax: +56 2 432 6702. , in distributed memory systems, and describes strategies to reduce the inter-processor communication and to improve the load balance at search time. Indexed Text SearchingThe advent of powerful processors and cheap storage has enabled alternative models for information retrieval, other than the traditional one of a collection of documents indexed by a fixed set of keywords. One is the full text model, in which the user expresses its information need via words, phrases or patterns to be matched for, and the information system retrieves those documents containing the user-specified pa...

show abstract

“…In bioinformatics and text mining applications suffix arrays with some further annotations are often used as an indexing structure for fast string querying [1], and also in the data compression community suffix arrays received more and more attention over the last decade. At first, this interest has arisen from the close relation with the Burrows-Wheeler-Transform [4] which is mainly based on the fact that computing the Burrows-WheelerTransform by block-sorting the input string is equivalent to suffix array construction.…”

Section: Introductionmentioning

confidence: 99%

“…In order to define the other two equivalences, we first introduce a bijective mapping m of the characters of a string t to the first |Σ(t)| integers, m : Σ(t) −→ [1, |Σ(t)|] such that m(t) = m(t [1])m(t [2]) . .…”

Section: Introductionmentioning

confidence: 99%

Counting Suffix Arrays and Strings

Schürmann

Stoye

2005

String Processing and Information Retrieval

View full text Add to dashboard Cite

Suffix arrays are used in various application and research areas like data compression or computational biology. In this work, our goal is to characterize the combinatorial properties of suffix arrays and their enumeration. For fixed alphabet size and string length we count the number of strings sharing the same suffix array and the number of such suffix arrays. Our methods have applications to succinct suffix arrays and build the foundation for the efficient generation of appropriate test data sets for suffix array based algorithms. We also show that summing up the strings for all suffix arrays builds a particular instance for some summation identities of Eulerian numbers.

show abstract

Replacing suffix trees with enhanced suffix arrays

Cited by 572 publications

References 10 publications

Engineering a Compressed Suffix Tree Implementation

Engineering a Compressed Suffix Tree Implementation

Distributed text search using suffix arrays

Counting Suffix Arrays and Strings

Contact Info

Product

Resources

About