HighlightsRaptor is a tool to search through large collections of genomic texts Raptor is 12-144 times faster and uses up to 30 times less RAM than COBS or MantisThe Raptor index is 6-50 times faster to build The use of minimizers and Bloom filters makes Raptor very spaceefficient
Motivation
The ever-growing size of sequencing data is a major bottleneck in bioinformatics as the advances of hardware development cannot keep up with the data growth. Therefore, an enormous amount of data is collected but rarely ever reused, because it is nearly impossible to find meaningful experiments in the stream of raw data.
Results
As a solution, we propose Needle, a fast and space-efficient index which can be built for thousands of experiments in less than two hours and can estimate the quantification of a transcript in these experiments in seconds, thereby outperforming its competitors. The basic idea of the Needle index is to create multiple interleaved Bloom filters that each store a set of representative k-mers depending on their multiplicity in the raw data. This is then used to quantify the query.
Supplementary information
Supplementary data are available at Bioinformatics online.
Availability and implementation
https://github.com/seqan/needle
We present a novel data structure for searching sequences in large databases: the Hierarchical Interleaved Bloom Filter (HIBF). It is extremely fast and space efficient, yet so general that it could serve as the underlying engine for many applications. We show that the HIBF is superior in build time, index size, and search time while achieving a comparable or better accuracy compared to other state-of-the-art tools. The HIBF builds an index up to 211 times faster, using up to 14 times less space, and can answer approximate membership queries faster by a factor of up to 129.
Searching sequences in large, distributed databases is the most widely used bioinformatics analysis done. This basic task is in dire need for solutions that deal with the exponential growth of sequence repositories and perform approximate queries very fast. In this paper, we present a novel data structure: the Hierarchical Interleaved Bloom Filter (HIBF). It is extremely fast and space efficient, yet so general that it has the potential to serve as the underlying engine for many applications. We show that the HIBF is superior in build time, index size and search time while achieving a comparable or better accuracy compared to other state-of-the art tools (Mantis and Bifrost). The HIBF builds an index up to 211 times faster, using up to 14 times less space and can answer approximate membership queries faster by a factor of up to 129. This can be considered a quantum leap that opens the door to indexing complete sequence archives like the European Nucleotide Archive or even larger metagenomics data sets.
We present Raptor, a tool for approximately searching many queries in large collections of nucleotide sequences. In comparison with similar tools like Mantis and COBS, Raptor is 12-144 times faster and uses up to 30 times less memory. Raptor uses winnowing minimizers to define a set of representative k-mers, an extension of the Interleaved Bloom Filters (IBF) as a set membership data structure, and probabilistic thresholding for minimizers. Our approach allows compression and a partitioning of the IBF to enable the effective use of secondary memory.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.