Extremely-fast construction and querying of compacted and colored de Bruijn graphs with GGCAT

Cracco, Andrea; Tomescu, Alexandru I.

doi:10.1101/2022.10.24.513174

Cited by 15 publications

(29 citation statements)

References 49 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Kmer-based methods have found wide-spread use in many areas of bioinformatics over the past years. However, they usually rely on unitigs to represent the kmer sets, since they can be computed efficiently with standard tools [33, 23, 40, 41]. Unitigs have the additional property that the de Bruijn graph topology can easily be reconstructed from them, since they do not contain branching nodes other than on their first and last kmer.…”

Section: Discussionmentioning

confidence: 99%

“…In that work, the size of the SPSS is very minor compared to the size of the index, however, major components of the index may be smaller if the SPSS contains less strings, which can be achieved by using greedy matchtigs. Our algorithms were also integrated into the external-memory de Bruijn graph compactor GGCAT [41], which was easy to do [5] . [4] While this paper was under review, Schmidt and Alanko realised that the algorithm to compute matchtigs can also be used to compute optimal simplitigs, by leaving out all parts related to repeating kmers.…”

Section: Introductionmentioning

confidence: 99%

“…Since unitigs contain no branches in their inner nodes, they do not alter the topology of the graph, and in turn enable the exact same set of analyses. There are highly engineered solutions available to compute a compacted de Bruijn graph by computing unitigs from any set of strings in memory [23] or with external memory [33, 40, 41]. Incidentally, the set of unitigs computed from a set of strings is also a way to store a set of kmers without repetition, and thus in reasonably small space.…”

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Matchtigs: minimum plain text representation of kmer sets

Schmidt

Khan

Alanko

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Kmer-based methods are widely used in bioinformatics, which raises the question of what is the smallest practically usable representation (i.e. plain text) of a set of kmers. We propose a polynomial algorithm computing a minimum such representation (which was previously posed as a potentially NP-hard open problem), as well as an efficient near-minimum greedy heuristic. When compressing genomes of large model organisms, read sets thereof or bacterial pangenomes, with only a minor runtime increase, we decrease the size of the representation by up to 60% over unitigs and 27% over previous work. Additionally, the number of strings is decreased by up to 97% over unitigs and 91% over previous work. Finally we show that a small representation has advantages in downstream applications, as it speeds up queries on the popular kmer indexing tool Bifrost by 1.66× over unitigs and 1.29× over previous work.Availabilityhttps://github.com/algbio/matchtigs

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Matchtigs: minimum plain text representation of kmer sets

Schmidt

Khan

Alanko

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…We show that our approach is ideal for pseudoaligning Nanopore long-read sequencing data, where the previous methods struggle, while simultaneously achieving rapid query times and small index size. Our implementation also provides an efficient way to construct the index making use of recent advances on colored unitig extraction algorithms (Cracco and Tomescu, 2022) and is an order of magnitude faster than Bifrost and Metagraph for reference databases containing 100,000 or more bacterial genomes. These factors enable Themisto to leverage much larger databases than previous methods, thus representing a significant methodological advance in pseudoalignment.…”

Section: Introductionmentioning

confidence: 99%

Themisto: a scalable coloredk-mer index for sensitive pseudoalignment against hundreds of thousands of bacterial genomes

Alanko

Vuohtoniemi

Mäklin

et al. 2023

Preprint

View full text Add to dashboard Cite

Motivation: Huge data sets containing whole-genome sequences of bacterial strains are now commonplace and represent a rich and important resource for modern genomic epidemiology and metagenomics. In order to efficiently make use of these data sets, efficient indexing data structures - that are both scalable and provide rapid query throughput - are paramount. Results: Here, we present Themisto, a scalable colored k-mer index designed for large collections of microbial reference genomes, that works for both short and long read data. Themisto indexes 179 thousand Salmonella enterica genomes in 9 hours. The resulting index takes 142 gigabytes, and Themisto pseudoaligns reads from a Salmonella enterica isolate sample against the index at a rate of 2 million base pairs per second on 48 threads. In comparison, the best competing tools Metagraph and Bifrost were only able to index 11 thousand genomes in the same time. In pseudoalignment, these other tools were either an order of magnitude slower than Themisto, or used an order of magnitude more memory. Themisto also offers superior pseudoalignment quality, achieving a higher recall than previous methods on Nanopore read sets. Availability and implementation: Themisto is available and documented as a C++ package at https://github.com/algbio/themisto available under the GPLv2 license.

show abstract

Indexing All Life’s Known Biological Sequences

Karasikov¹,

Mustafa²,

Danciu³

et al. 2020

Preprint

108

View full text Add to dashboard Cite

The amount of biological sequencing data available in public repositories is growing exponentially, forming an invaluable biomedical research resource. Yet, making all this sequencing data searchable and easily accessible to life science and data science researchers is an unsolved problem. We present MetaGraph, a versatile framework for the scalable analysis of extensive sequence repositories. MetaGraph efficiently indexes vast collections of sequences to enable fast search and comprehensive analysis. A wide range of underlying data structures offer different practically relevant trade-offs between the space taken by an index and its query performance. Achieving compression ratios of up to 1,000-fold over the already compressed raw input data, MetaGraph indexes can represent the content of large sequencing archives in the working memory of a single compute server. We demonstrate our framework's scalability by indexing over 1.4 million whole genome sequencing (WGS) records from NCBI's Sequence Read Archive, representing a total input of more than three petabases. MetaGraph provides a flexible methodological framework allowing for index construction to be scaled from consumer laptops to distribution onto a cloud compute cluster for processing terabases to petabases of input data. Notably, processing of data sets ranging from 1 TB of raw WGS reads to 20 TB of human RNA-sequencing data results in indexes whose memory footprints are small enough to host on standard desktop workstations. Besides demonstrating the utility of MetaGraph indexes on key applications, such as experiment discovery, sequence alignment, error correction, and differential assembly, we make a wide range of indexes available as a community resource, including indexes of over 450,000 microbial WGS records, more than 110,000 fungi WGS records, and more than 40,000 whole metagenome sequencing records. A subset of these indexes is made available online for interactive queries. All indexes will be available for download and in the cloud. In total, indexes comprising more than 1 million sequencing records are available for download. As an example of our indexes' integrative analysis capabilities, we introduce the concept of differential assembly, which allows for the extraction of sequences present in a foreground set of samples but absent in a given background set. We apply this technique to differentially assemble contigs to identify pathogenic agents transfected via human kidney transplants. In a second example, we indexed more than 20,000 human RNA-Seq records from the TCGA and GTEx cohorts and use them to extract transcriptome features that are hard to characterize using a classical linear reference. We discovered over 200 trans-splicing events in GTEx and found broad evidence for tissue-specific non-A-to-I RNA-editing in GTEx and TCGA.

show abstract

Extremely-fast construction and querying of compacted and colored de Bruijn graphs with GGCAT

Cited by 15 publications

References 49 publications

Matchtigs: minimum plain text representation of kmer sets

Matchtigs: minimum plain text representation of kmer sets

Themisto: a scalable coloredk-mer index for sensitive pseudoalignment against hundreds of thousands of bacterial genomes

Indexing All Life’s Known Biological Sequences

Contact Info

Product

Resources

About