Raptor: A fast and space-efficient pre-filter for querying very large collections of nucleotide sequences

Seiler, Enrico; Mehringer, Svenja; Darvish, Mitra; Turc, Etienne; Reinert, Knut

doi:10.1016/j.isci.2021.102782

Cited by 8 publications

(43 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The runtimes were initially significantly improved by the Patro group with the tool Mantis in [18] and by the Iqbal group with COBS [3]. This year, the Reinert lab introduced the IBF [23], which has proven to be a significant step towards a very time and space efficient in-memory data structure for preprocessing approximate sequence queries, which opens up many possible applications. It improves in runtime by a factor of 12-144 over its competitors COBS and Mantis.…”

Section: Related Workmentioning

confidence: 99%

“…as a bit mask. Like Seiler et al [23], it uses minimizers to reduce the number of 𝑘-mers to be queried and thus the number of costly memory accesses. We decided on the Intel FPGA SDK for OpenCL version 2021.3 as the implementation environment, as it offers a high-level programming model with an acceptable overhead and encapsulates the entire host interaction in a well-known API, which allowed us to focus on the algorithmic optimizations of the problem.…”

Section: Count(p )mentioning

confidence: 99%

“…One of the main tasks of such pipelines is to (approximately) search large reference data sets for sequencing reads or short sequence patterns like genes. Hence, researchers had to develop novel indexing data structures such as the Interleaved Bloom Filter (IBF) [7] and, based on this, an extension with winnowing minimizers and probabilistic thresholding called Raptor [23] which is currently the state-ofthe-art for distributing approximate queries with an in-memory data structure. The CPU-based IBF implementation can distribute 10 million NGS queries for combined texts of hundreds of Gigabytes in only a few seconds.…”

Section: Introductionmentioning

confidence: 99%

“…The following section contains a brief introduction to the IBF data structure, details can be found in [23].…”

Section: Introductionmentioning

confidence: 99%

“…For the approximate search of a query 𝑃, the binning bitvectors of all representative 𝑘-mers in the query are combined into a counting vector and the membership of a query in a bin is determined by applying an appropriate threshold (see [23]). This approach is depicted in Figure 2.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Co-Design for Energy Efficient and Fast Genomic Search

Knaust

Seiler

Reinert

et al. 2022

Proceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Self Cite

View full text Add to dashboard Cite

Next-Generation Sequencing technologies generate a vast and exponentially increasing amount of sequence data. The Interleaved Bloom Filter (IBF) is a novel indexing data structure which is stateof-the-art for distributing approximate queries with an in-memory data structure. With it, a main task of sequence analysis pipelines, (approximately) searching large reference data sets for sequencing reads or short sequence patterns like genes, can be significantly accelerated. To meet performance and energy-efficiency requirements, we chose a co-design approach of the IBF data structure on the FPGA platform. Further, our OpenCL-based implementation allows a seamless integration into the widely used SeqAn C++ library for biological sequence analysis. Our algorithmic design and optimization strategy takes advantage of FPGA-specific features like shift register and the parallelization potential of many bitwise operations. We designed a well-chosen schema to partition data across the different memory domains on the FPGA platform using the Shared Virtual Memory concept. We can demonstrate significant improvements in energy efficiency of up to 19 × and in performance of up to 5.6 ×, respectively, compared to a well-tuned, multithreaded CPU reference. CCS CONCEPTS• Computer systems organization → Reconfigurable computing; • Applied computing → Bioinformatics; Computational genomics.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Count(p )mentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

“…The following section contains a brief introduction to the IBF data structure, details can be found in [23].…”

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Co-Design for Energy Efficient and Fast Genomic Search

Knaust

Seiler

Reinert

et al. 2022

Proceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Self Cite

View full text Add to dashboard Cite

show abstract

Creating and Using Minimizer Sketches in Computational Genomics

Zheng,

Marçais,

Kingsford

2023

Journal of Computational Biology

View full text Add to dashboard Cite

Needle: a fast and space-efficient prefilter for estimating the quantification of very large collections of expression experiments

Darvish

Seiler

Mehringer³

et al. 2022

Bioinformatics

Self Cite

View full text Add to dashboard Cite

Motivation The ever-growing size of sequencing data is a major bottleneck in bioinformatics as the advances of hardware development cannot keep up with the data growth. Therefore, an enormous amount of data is collected but rarely ever reused, because it is nearly impossible to find meaningful experiments in the stream of raw data. Results As a solution, we propose Needle, a fast and space-efficient index which can be built for thousands of experiments in less than two hours and can estimate the quantification of a transcript in these experiments in seconds, thereby outperforming its competitors. The basic idea of the Needle index is to create multiple interleaved Bloom filters that each store a set of representative k-mers depending on their multiplicity in the raw data. This is then used to quantify the query. Supplementary information Supplementary data are available at Bioinformatics online. Availability and implementation https://github.com/seqan/needle

show abstract

Raptor: A fast and space-efficient pre-filter for querying very large collections of nucleotide sequences

Abstract: HighlightsRaptor is a tool to search through large collections of genomic texts Raptor is 12-144 times faster and uses up to 30 times less RAM than COBS or MantisThe Raptor index is 6-50 times faster to build The use of minimizers and Bloom filters makes Raptor very spaceefficient

Cited by 8 publications

References 20 publications

Co-Design for Energy Efficient and Fast Genomic Search

Co-Design for Energy Efficient and Fast Genomic Search

Creating and Using Minimizer Sketches in Computational Genomics

Needle: a fast and space-efficient prefilter for estimating the quantification of very large collections of expression experiments

Contact Info

Product

Resources

About