2022
DOI: 10.1101/2022.08.01.502266
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Hierarchical Interleaved Bloom Filter: Enabling ultrafast, approximate sequence queries

Abstract: Searching sequences in large, distributed databases is the most widely used bioinformatics analysis done. This basic task is in dire need for solutions that deal with the exponential growth of sequence repositories and perform approximate queries very fast. In this paper, we present a novel data structure: the Hierarchical Interleaved Bloom Filter (HIBF). It is extremely fast and space efficient, yet so general that it has the potential to serve as the underlying engine for many applications. We show that the … Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
11
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
4
1

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(11 citation statements)
references
References 24 publications
0
11
0
Order By: Relevance
“…We evaluated the performances of kmindex together with eight state-of-the-art k-mer indexers: themisto [2]; ggcat [7]; HIBF [18]; PAC [17]; MetaProFi [22]; MetaGraph [13]; Bifrost [12]; and COBS [3]. The dataset for this benchmark is composed of metagenomic seawater sequencing data from 50 Tara Oceans samples, of 1.4TB of gzipped fastq files.…”
Section: Comparative Results Indexing 50 Metagenomic Seawater Samplesmentioning
confidence: 99%
“…We evaluated the performances of kmindex together with eight state-of-the-art k-mer indexers: themisto [2]; ggcat [7]; HIBF [18]; PAC [17]; MetaProFi [22]; MetaGraph [13]; Bifrost [12]; and COBS [3]. The dataset for this benchmark is composed of metagenomic seawater sequencing data from 50 Tara Oceans samples, of 1.4TB of gzipped fastq files.…”
Section: Comparative Results Indexing 50 Metagenomic Seawater Samplesmentioning
confidence: 99%
“…However, storing k-mer counts and variant graphs in memory may lead to significant memory consumption, particularly when dealing with large amounts of sequencing data or reference genomes. Alternatively, implementing a Counting Bloom filter or Hierarchical Interleaved Bloom Filter (HIBF) instead of storing k-mer counts in memory may help reduce memory consumption while maintaining the desired level of accuracy [46,47]. PanGenie, another software program employing k-mer counts, exhibits significant memory consumption despite only considering unique k-mers [19].…”
Section: Discussionmentioning
confidence: 99%
“…A gentle compression with (24, 20)-minimizers already reduces the index size to a third of the 20-mer HIBF. A compression with (40,32)-minimizer already reduces the size by a factor of 5 compared to a 32-mer HIBF. A small index using minimizers speeds up the query time (e.g., by a factor of 4 for (40, 32) minimizers) compared to the uncompressed HIBF.…”
Section: Flexible Compression Using Minimizersmentioning
confidence: 99%