FEMTO: Fast Search of Large Sequence Collections

Ferguson, Michael P.

doi:10.1007/978-3-642-31265-6_17

Cited by 5 publications

(4 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This paper has initiated the study of a new class of pattern matching problems over “Stringomes,” and proposed several algorithmic solutions, which are instantiations of a basic data-structural scheme. The resulting algorithms are shown to be rather simple and yet efficient in space and time, so they are amenable to be implemented by using known geometric and string-matching libraries (such as LEDA and PizzaChili, just to name a few) or as an extension of the FEMTO software package [8]. The solutions proposed here have immediate applications to next-generation sequencing technologies, base-calling, variant-calling, expression analysis, population studies and onco-genomics.…”

Section: Discussionmentioning

confidence: 99%

Algorithms in Stringomics (I): Pattern-Matching against “Stringomes”

Ferragina

Mishra

2014

Preprint

View full text Add to dashboard Cite

This paper reports an initial design of new data-structures that generalizes the idea of pattern-matching in stringology, from its traditional usage in an (unstructured) set of strings to the arena of a well-structured family of strings. In particular, the object of interest is a family of strings composed of blocks/classes of highly similar “stringlets,” and thus mimic a population of genomes made by concatenating haplotype-blocks, further constrained by haplotype-phasing. Such a family of strings, which we dub “stringomes,” is formalized in terms of a multi-partite directed acyclic graph with a source and a sink. The most interesting property of stringomes is probably the fact that they can be represented efficiently with compression up to their k-th order empirical entropy, while ensuring that the compression does not hinder the pattern-matching counting and reporting queries – either internal to a block or spanning two (or a few constant) adjacent blocks. The solutions proposed here have immediate applications to next-generation sequencing technologies, base-calling, expression profiling, variant-calling, population studies, onco-genomics, cyber security trace analysis and text retrieval.

show abstract

Section: Discussionmentioning

confidence: 99%

Algorithms in Stringomics (I): Pattern-Matching against “Stringomes”

Ferragina

Mishra

2014

Preprint

View full text Add to dashboard Cite

show abstract

“…Ferguson shows that it is even possible to execute regular expressions on data stored using an fm-index [8]. In his paper, he describes a system called femto, which can index large datasets while still maintaining adequate performance.…”

Section: Burrows-wheeler Transform and Fm-indexmentioning

confidence: 99%

Substring Filtering for Low-Cost Linked Data Interfaces

Herwegen

Vocht

Verborgh

et al. 2015

The Semantic Web - ISWC 2015

View full text Add to dashboard Cite

Abstract. Recently, Triple Pattern Fragments (tpfs) were introduced as a low-cost server-side interface when high numbers of clients need to evaluate sparql queries. Scalability is achieved by moving part of the query execution to the client, at the cost of elevated query times. Since the tpf interface purposely does not support complex constructs such as sparql filters, queries that use them need to be executed mostly on the client, resulting in long execution times. We therefore investigated the impact of adding a literal substring matching feature to the tpf interface, with the goal of improving query performance while maintaining low server cost. In this paper, we discuss the client/server setup and compare the performance of sparql queries on multiple implementations, including Elastic Search and case-insensitive fm-index. Our evaluations indicate that these improvements allow for faster query execution without significantly increasing the load on the server. Offering the substring feature on tpf servers allows users to obtain faster responses for filter-based sparql queries. Furthermore, substring matching can be used to support other filters such as complete regular expressions or range queries.

show abstract

“…Other recent work is by Ferguson [26], who describes a search structure called FEMTO, and provides experiments on 43 GB of English text (Project Gutenberg files), and on 182 GB of genomic data. The FEMTO system uses a partitioned FM-INDEX, with the search for each pattern proceeding through (at least) one disk block per symbol.…”

Section: B Other Recent Workmentioning

confidence: 99%

Large-Scale Pattern Search Using Reduced-Space On-Disk Suffix Arrays

Gog

Moffat

Culpepper

et al. 2014

IEEE Trans. Knowl. Data Eng.

View full text Add to dashboard Cite

Abstract-The suffix array is an efficient data structure for in-memory pattern search. Suffix arrays can also be used for external-memory pattern search, via two-level structures that use an internal index to identify the correct block of suffix pointers. In this paper we describe a new two-level suffix array-based index structure that requires significantly less disk space than previous approaches. Key to the saving is the use of disk blocks that are based on prefixes rather than the more usual uniform-sampling approach, allowing reductions between blocks and subparts of other blocks. We also describe a new in-memory structure -the condensed BWT -and show that it allows common patterns to be resolved without access to the text. Experiments using 64 GB of English web text on a computer with 4 GB of main memory demonstrate the speed and versatility of the new approach. For this data the index is around one-third the size of previous twolevel mechanisms; and the memory footprint of as little as 1% of the text size means that queries can be processed more quickly than is possible with a compact FM-INDEX.

show abstract

FEMTO: Fast Search of Large Sequence Collections

Cited by 5 publications

References 16 publications

Algorithms in Stringomics (I): Pattern-Matching against “Stringomes”

Algorithms in Stringomics (I): Pattern-Matching against “Stringomes”

Substring Filtering for Low-Cost Linked Data Interfaces

Large-Scale Pattern Search Using Reduced-Space On-Disk Suffix Arrays

Contact Info

Product

Resources

About