Extremely fast construction and querying of compacted and colored de Bruijn graphs with GGCAT

Cracco, Andrea; Tomescu, Alexandru I.

doi:10.1101/gr.277615.122

Cited by 19 publications

(17 citation statements)

References 45 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We evaluated the performances of kmindex together with eight state-of-the-art k-mer indexers: themisto [2]; ggcat [7]; HIBF [18]; PAC [17]; MetaProFi [22]; MetaGraph [13]; Bifrost [12]; and COBS [3]. The dataset for this benchmark is composed of metagenomic seawater sequencing data from 50 Tara Oceans samples, of 1.4TB of gzipped fastq files.…”

Section: Comparative Results Indexing 50 Metagenomic Seawater Samplesmentioning

confidence: 99%

kmindex and ORA: indexing and real-time user-friendly queries in terabyte-sized complex genomic datasets

Lemane,

Lezzoche,

Lecubin

et al. 2023

Preprint

View full text Add to dashboard Cite

Despite their wealth of biological information, public sequencing databases are largely underutilized. One cannot efficiently search for a sequence of interest in these immense resources. Sophisticated computational methods such as approximate membership query data structures allow searching for fixed-length words (k-mers) in large datasets. Yet they face scalability challenges when applied to thousands of complex sequencing experiments. In this context we propose kmindex, a new approach that uses inverted indexes based on Bloom filters. Thanks to its algorithmic choices and its fine-tuned implementation, kmindex offers the possibility to index thousands of highly complex metagenomes into an index that answers sequences queries in the tenth of a second. Index construction is one order of magnitude faster than previous approaches, and query time is two orders of magnitude faster. Based on Bloom filters, kmindex achieves negligible false positive rates, below 0.01% on average. Its average false positive rate is four orders of magnitude lower than existing approaches, for similar index sizes. It has been successfully used to index 1,393 complex marine seawater metagenome samples of raw sequences from the Tara Oceans project, demonstrating its effectiveness on large and complex datasets. This level of scaling was previously unattainable. Building on the kmindex results, we provide a public web server named "Ocean Read Atlas" (ORA) at https://ocean-read-atlas.mio.osupytheas.fr/ that can answer queries against the entire Tara Oceans dataset in real-time. kmindex is open-source software available at https://github.com/tlemane/kmindex.

show abstract

Section: Comparative Results Indexing 50 Metagenomic Seawater Samplesmentioning

confidence: 99%

kmindex and ORA: indexing and real-time user-friendly queries in terabyte-sized complex genomic datasets

Lemane,

Lezzoche,

Lecubin

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…The task of counting maximal unitigs for all k values takes O(|V U |) time when using a Prokrustean graph (section 2.2). We compare with GGCAT [8], an efficient compacted de Bruijn graph generating algorithm. 3: Counting maximal unitigs with a range of k. The Prokrustean approach is orders of magnitude faster than GGCAT, which is not designed for this task and was called each k separately to compute de Bruijn graphs.…”

Section: Application: Counting Maximal Unitigs Of De Bruijn Graphs Fo...mentioning

confidence: 99%

“…Fig. 1: The performance of Prokrustean graphs for two representative functionalities were compared with state-of-thearts: KMC [15] is a k-mer counting tool and GGCAT [8] constructs compacted de Bruijn graphs . Both were iterative called to extract k-mer/unitig counts of k = 30 .…”

Section: Introductionmentioning

confidence: 99%

Prokrustean Graph: A substring index for rapid k-mer size analysis

Park,

Koslicki

2023

Preprint

View full text Add to dashboard Cite

Despite the widespread adoption ofk-mer-based methods in bioinformatics, a fundamental question persists: How to elucidate the structural transition of ak-mer set when the order switches tok′? Attaining a generalized answer has significant implications tok-mer-based methods where the influence ofkhave been empirically analyzed (eg. in areas of assembly, genome comparison, etc.).We unravel the problem with a principle:k-mers andk′-mers can be grouped by their co-occurrences, and those in the same group behave similarly regardless of applications. This concept is embodied in a model, the Prokrustean graph, which embraces all similarity information of a given sequence set and has a channel to accessk-mers of anyk. This gives us a theoretical framework in which we can understand the presence and frequency ofk-mers askchanges.Practically, a Prokrustean graph is a space efficient data structure that can quickly be queried to extract essentially arbitrary information regardingk-mers with time complexity independent ofk-mer size range. We provide a series of examples that perform in competitive time and space when compared to purpose-built tools that operate on a singlek, such as KMC and GGCAT. For example, with large read sets, we can count allk-mers fork= 30 … 150 in ≃ 30 seconds in comparison to KMC requiring ≃ 30 seconds for a singlek. Similarly, on long read sets, we can count all unitigs in ≃ 30 seconds for the entire rangek= 30 …50000, a task that is prohibitively burdensome for GGCAT.Our construction algorithm of Prokrustean graph utilizes an (extended) Burrows-Wheeler transform as input, is easily parallelizable, and operates inO(N) time whereNis the cumulative input sequence length. We provide theoretical justification that the size of a Prokrustean graph isO(N), and in practice, is significantly smaller thanN.All code and algorithms are publicly accessible at:https://github.com/KoslickiLab/prokrustean.

show abstract

“…It included assemblies of over 300,000 genomes which had not previously been available (the raw data only had been available). The assemblies and search indexes allowed multiple other studies of plasmids [5,6], bacterial adaptation [7,8,9,10], and compression/indexing algorithms [11,12,13,14,15]. However, there were a few limitations.…”

Section: Introductionmentioning

confidence: 99%

AllTheBacteria - all bacterial genomes assembled, available and searchable

Hunt,

Lima,

Anderson

et al. 2024

Preprint

View full text Add to dashboard Cite

The bacterial sequence data publicly available at the global DNA archives is a vast source of information on the evolution of bacteria and their mobile elements. However, most of it is either unassembled or inconsistently assembled and QC-ed. This makes it unsuitable for large-scale analyses, and inaccessible for most researchers to use. In 2021 Blackwell et al therefore released a uniformly assembled set of 661,405 genomes, consisting of all publicly available whole genome sequenced bacterial isolate data as of November 2018, along with various search indexes. In this study we extend that dataset by 4.5 years (up to May 2023), tripling the number of genomes. We also expand the scope, as we begin a global collaborative project to generate annotations for different species as desired by different research communities. In this study we describe the initial v0.1 data release of 1,932,812 assemblies (combining 1,271,428 new assemblies with the 661k dataset). All 1.9 million have been uniformly re-processed for quality criteria and to give taxonomic abundance estimates with respect to the GTDB phylogeny. Using an evolution-informed compression approach, the full set of genomes is just 102Gb in batched xz archives. We also provide multiple search indexes. Finally, we outline plans for future annotations to be provided in further releases.

show abstract

Extremely fast construction and querying of compacted and colored de Bruijn graphs with GGCAT

Cited by 19 publications

References 45 publications

kmindex and ORA: indexing and real-time user-friendly queries in terabyte-sized complex genomic datasets

kmindex and ORA: indexing and real-time user-friendly queries in terabyte-sized complex genomic datasets

Prokrustean Graph: A substring index for rapid k-mer size analysis

AllTheBacteria - all bacterial genomes assembled, available and searchable

Contact Info

Product

Resources

About