Compressed Suffix Arrays for Massive Data

Sirén, Jouni

doi:10.1007/978-3-642-03784-9_7

Cited by 39 publications

(46 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Once the unique subset U of R has been calculated, we do not need to recompute the FM-index of U from scratch. The BWT of U can be derived from the FM-index of R by marking the positions in B R that correspond to reads that were discarded and exporting the unmarked positions as B U (Sirén 2009). …”

Section: Read Filteringmentioning

confidence: 99%

Efficient de novo assembly of large genomes using compressed data structures

Simpson¹,

Durbin²

2011

Genome Res.

693

614

View full text Add to dashboard Cite

De novo genome sequence assembly is important both to generate new sequence assemblies for previously uncharacterized genomes and to identify the genome sequence of individuals in a reference-unbiased way. We present memory efficient data structures and algorithms for assembly using the FM-index derived from the compressed Burrows-Wheeler transform, and a new assembler based on these called SGA (String Graph Assembler). We describe algorithms to error-correct, assemble, and scaffold large sets of sequence data. SGA uses the overlap-based string graph model of assembly, unlike most de novo assemblers that rely on de Bruijn graphs, and is simply parallelizable. We demonstrate the error correction and assembly performance of SGA on 1.2 billion sequence reads from a human genome, which we are able to assemble using 54 GB of memory. The resulting contigs are highly accurate and contiguous, while covering 95% of the reference genome (excluding contigs <200 bp in length). Because of the low memory requirements and parallelization without requiring inter-process communication, SGA provides the first practical assembler to our knowledge for a mammalian-sized genome on a low-end computing cluster.

show abstract

Section: Read Filteringmentioning

confidence: 99%

Efficient de novo assembly of large genomes using compressed data structures

Simpson¹,

Durbin²

2011

Genome Res.

693

614

View full text Add to dashboard Cite

show abstract

“…Linear time algorithms exist for the task, but their practical bottleneck is the peak memory consumption. Although there exist general time‐efficient and space‐efficient construction algorithms, it turned out that our special case of text collection admits a tailored incremental BWT construction algorithm (see the references and experimental comparison therein for previous work on BWT construction): The text collection is split into several smaller collections, and a temporary index is built for each of them separately. The temporary indexes are then merged and finally, converted into a static FM‐index.…”

Section: Text Representationmentioning

confidence: 99%

Fast in‐memory XPath search using compressed indexes

et al. 2013

Self Cite

View full text Add to dashboard Cite

Artículo de publicación ISIExtensible Markup Language (XML) documents consist of text data plus structured data (markup). XPath allows to query both text and structure. Evaluating such hybrid queries is challenging. We present a system for in-memory evaluation of XPath search queries, that is, queries with text and structure predicates, yet without advanced features such as backward axes, arithmetics, and joins. We show that for this query fragment, which contains Forward Core XPath, our system, dubbed Succinct XML Self-Index (‘SXSI’), outperforms existing systems by 1–3 orders of magnitude. SXSI is based on state-of-the-art indexes for text and structure data. It combines two novelties. On one hand, it represents the XML data in a compact indexed form, which allows it to handle larger collections in main memory while supporting powerful search and navigation operations over the text and the structure. On the other hand, it features an execution engine that uses tree automata and cleverly chooses evaluation orders that leverage the speeds of the respective indexes. SXSI is modular and allows seamless replacement of its indexes. This is demonstrated through experiments with (1) a text index specialized for search of bio sequences, and (2) a word-based text index specialized for natural language search.Fondecyt, Chile 1-11006

show abstract

“…Linear time algorithms exist for the task, but their practical bottleneck is the peak memory consumption. Although there exist general time-efficient and space-efficient construction algorithms, it turned out that our special case of text collection admits a tailored incremental BWT construction algorithm [40] (see the references and experimental comparison therein for previous work on BWT construction): The text collection is split into several smaller collections, and a temporary index is built for each of them separately. The temporary indexes are then merged and finally, converted into a static FM-index.…”

Section: Construction and Text Extractionmentioning

confidence: 99%

Fast in-memory XPath search using compressed indexes

Arroyuelo¹,

Claude

Maneth

et al. 2010

2010 IEEE 26th International Conference on Data Engineering (ICDE 2010)

View full text Add to dashboard Cite

A large fraction of an XML document typically consists of text data. The XPath query language allows text search via the equal, contains, and starts-with predicates. Such predicates can be efficiently implemented using a compressed self-index of the document's text nodes. Most queries, however, contain some parts querying the text of the document, plus some parts querying the tree structure. It is therefore a challenge to choose an appropriate evaluation order for a given query, which optimally leverages the execution speeds of the text and tree indexes. Here the SXSI system is introduced. It stores the tree structure of an XML document using a bit array of opening and closing brackets plus a sequence of labels, and stores the text nodes of the document using a global compressed self-index. On top of these indexes sits an XPath query engine that is based on tree automata. The engine uses fast counting queries of the text index in order to dynamically determine whether to evaluate top-down or bottom-up with respect to the tree structure. The resulting system has several advantages over existing systems: (1) on pure tree queries (without text search) such as the XPathMark queries, the SXSI system performs on par or better than the fastest known systems MonetDB and Qizx, (2) on queries that use text search, SXSI outperforms the existing systems by 1-3 orders of magnitude (depending on the size of the result set), and (3) with respect to memory consumption, SXSI outperforms all other systems for counting-only queries.

show abstract

Compressed Suffix Arrays for Massive Data

Cited by 39 publications

References 25 publications

Efficient de novo assembly of large genomes using compressed data structures

Efficient de novo assembly of large genomes using compressed data structures

Fast in‐memory XPath search using compressed indexes

Fast in-memory XPath search using compressed indexes

Contact Info

Product

Resources

About