Hierarchical Interleaved Bloom Filter: Enabling ultrafast, approximate sequence queries

Mehringer, Svenja; Seiler, Enrico; Droop, Felix; Darvish, Mitra; Rahn, René; Vingron, Martin; Reinert, Knut

doi:10.1101/2022.08.01.502266

Cited by 5 publications

(11 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We evaluated the performances of kmindex together with eight state-of-the-art k-mer indexers: themisto [2]; ggcat [7]; HIBF [18]; PAC [17]; MetaProFi [22]; MetaGraph [13]; Bifrost [12]; and COBS [3]. The dataset for this benchmark is composed of metagenomic seawater sequencing data from 50 Tara Oceans samples, of 1.4TB of gzipped fastq files.…”

Section: Comparative Results Indexing 50 Metagenomic Seawater Samplesmentioning

confidence: 99%

kmindex and ORA: indexing and real-time user-friendly queries in terabyte-sized complex genomic datasets

Lemane,

Lezzoche,

Lecubin

et al. 2023

Preprint

View full text Add to dashboard Cite

Despite their wealth of biological information, public sequencing databases are largely underutilized. One cannot efficiently search for a sequence of interest in these immense resources. Sophisticated computational methods such as approximate membership query data structures allow searching for fixed-length words (k-mers) in large datasets. Yet they face scalability challenges when applied to thousands of complex sequencing experiments. In this context we propose kmindex, a new approach that uses inverted indexes based on Bloom filters. Thanks to its algorithmic choices and its fine-tuned implementation, kmindex offers the possibility to index thousands of highly complex metagenomes into an index that answers sequences queries in the tenth of a second. Index construction is one order of magnitude faster than previous approaches, and query time is two orders of magnitude faster. Based on Bloom filters, kmindex achieves negligible false positive rates, below 0.01% on average. Its average false positive rate is four orders of magnitude lower than existing approaches, for similar index sizes. It has been successfully used to index 1,393 complex marine seawater metagenome samples of raw sequences from the Tara Oceans project, demonstrating its effectiveness on large and complex datasets. This level of scaling was previously unattainable. Building on the kmindex results, we provide a public web server named "Ocean Read Atlas" (ORA) at https://ocean-read-atlas.mio.osupytheas.fr/ that can answer queries against the entire Tara Oceans dataset in real-time. kmindex is open-source software available at https://github.com/tlemane/kmindex.

show abstract

Section: Comparative Results Indexing 50 Metagenomic Seawater Samplesmentioning

confidence: 99%

kmindex and ORA: indexing and real-time user-friendly queries in terabyte-sized complex genomic datasets

Lemane,

Lezzoche,

Lecubin

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…However, storing k-mer counts and variant graphs in memory may lead to significant memory consumption, particularly when dealing with large amounts of sequencing data or reference genomes. Alternatively, implementing a Counting Bloom filter or Hierarchical Interleaved Bloom Filter (HIBF) instead of storing k-mer counts in memory may help reduce memory consumption while maintaining the desired level of accuracy [46,47]. PanGenie, another software program employing k-mer counts, exhibits significant memory consumption despite only considering unique k-mers [19].…”

Section: Discussionmentioning

confidence: 99%

A comprehensive benchmark of graph-based genetic variant genotyping algorithms on plant genomes for creating an accurate ensemble pipeline

Jiao

2023

Preprint

View full text Add to dashboard Cite

Background: Although sequencing technologies have boosted the measurement of the sequencing diversity of plant crops, it remains challenging to accurately genotype millions of genetic variants, especially structural variations, with only short reads. In recent years, many graph-based variation genotyping methods have been developed to address this issue and tested for human genomes, however, their performance in plant genomes remains largely elusive. Furthermore, pipelines integrating the advantages of current genotyping methods might be required, considering the different complexity of plant genomes. Results: Here we comprehensively evaluate eight such genotypers in different scenarios in terms of variant type and size, sequencing parameters, genomic context, and complexity, as well as graph size, using both simulated and read data sets from representative plant genomes. Our evaluation reveals that there are still great challenges to applying existing methods to plants, such as excessive repeats and variants or high resource consumption. Therefore, we propose a pipeline called Ensemble Variant Genotyper (EVG) that can achieve better genotype concordances without increasing resource consumption. EVG can achieve comparably higher genotyping recall and precision even using 5X reads. Furthermore, we demonstrate that EVG is more robust with an increasing number of variants, especially for insertion and deletion. Conclusions: Our study will provide new insights into the development and application of graph-based genotyping algorithms. We conclude that EVG provides an accurate, unbiased, and cost-effective way for genotyping both small and large variations and will be potentially used in population-scale genotyping for large, repetitive, and heterozygous plant genomes.

show abstract

“…A gentle compression with (24, 20)-minimizers already reduces the index size to a third of the 20-mer HIBF. A compression with (40,32)-minimizer already reduces the size by a factor of 5 compared to a 32-mer HIBF. A small index using minimizers speeds up the query time (e.g., by a factor of 4 for (40, 32) minimizers) compared to the uncompressed HIBF.…”

Section: Flexible Compression Using Minimizersmentioning

confidence: 99%