deGSM: Memory Scalable Construction Of Large Scale de Bruijn Graph

Guo, Hongzhe; Fu, Yilei; Gao, Yan; Li, Junyi; Wang, Yadong; Liu, Bo

doi:10.1109/tcbb.2019.2913932

Cited by 17 publications

(18 citation statements)

References 43 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As a downside, the naive implementation of the heuristic using a standard hashtable may run into memory issues. In our work, we have not encountered this, but memory consumption can be readily improved using more advanced data structures, similarly to what has been done for tools for unitig computation [33,46,47] . We note that ProphAsm is a spin-off of the ProPhyle software ( https://prophyle.github.io/ , [27] ) for phylogeny-based metagenomic classification.…”

Section: Discussionmentioning

confidence: 96%

Simplitigs as an efficient and scalable representation of de Bruijn graphs

Břinda

Baym

Kucherov

2020

Preprint

View full text Add to dashboard Cite

MotivationDe Bruijn graphs play an essential role in computational biology, facilitating rapid alignment-free comparison of genomic datasets as well as reconstruction of underlying genomic sequences. Subsequently, an important question is how to efficiently represent, compress, and transmit de Bruijn graphs of the most common types of genomic data sets, such as sequencing reads, genomes, and pan-genomes. ResultsWe introduce simplitigs, an effective representation of de Bruijn graphs for alignment-free applications. Simplitigs are a generalization of unitigs and correspond to spellings of vertex-disjoint paths in a de Bruijn graph. We present an easy-to-plug-in greedy heuristic for their computation and provide a reference implementation in a program called ProphAsm. We use ProphAsm to compare the scaling of simplitigs and unitigs on a range of genomic datasets. We demonstrate that simplitigs are superior to unitigs in terms of the cumulative sequence length as well as of the number of sequences, and that they are sufficiently close to the theoretical bounds for practical applications. Finally, we demonstrate that, when combined with standard full-text indexes, simplitigs provide a scalable solution for k-mer search in pan-genomes. AvailabilityProphAsm is written in C++ and is available under the MIT license from De Bruijn graphs belong to the most popular graph representations of genomic datasets. They are defined as directed graphs where V is the set of all k-mers (i.e., subwords of a fixed length k) occurring in the V , ) G = ( E dataset with edges connecting a vertex v to a vertex w if there is a long prefix-suffix overlap between these v k − 1 and w. As follows from the definition, we can associate a de Bruijn graph with the underlying k-mer set and edges can be defined implicitly (unlike the edge-centric definition where k-mer sets are associated with edges [5] ). In this paper, we consider only vertex-centric graphs.De Bruijn graphs feature remarkable properties. First, their computation from data is easy and deterministic.Algorithms for enumerating and counting k-mers have been extensively studied and many programs are available [6][7][8][9] . If the datasets contain sequencing errors, the computation may also involve graph cleaning. This aims at removing those k-mers that are the result of sequencing errors and are due to their supposed randomness expected to be rare. Second, if k is chosen appropriately, de Bruijn graphs can capture substantial information about the entire molecules under sequencing as these correspond to (some of the) walks in the graphs, provided that sequencing was sufficiently deep. Third, de Bruijn graphs can be handled easily, which simplifies software development as well as dataset analysis and interpretation. These properties have led to a large variety of applications of de Bruijn graphs.De Bruijn graphs have been widely studied in the context of sequence assembly [10][11][12] . Here, their construction is typically the first step to the reconstruction of the genomes and transcr...

show abstract

Section: Discussionmentioning

confidence: 96%

Simplitigs as an efficient and scalable representation of de Bruijn graphs

Břinda

Baym

Kucherov

2020

Preprint

View full text Add to dashboard Cite

show abstract

“…Note that BCALM2 can process assembled genomes as well as short read data. deGSM [50] performs an external sorting of the k-mers from the input sequences and then constructs a Burrows-Wheeler transform (BWT) [51] of the unitigs from which the final graph is extracted. SplitMEM [30] uses the suffix tree [52] to construct a ccdBG.…”

Section: Introductionmentioning

confidence: 99%

Bifrost: highly parallel construction and indexing of colored and compacted de Bruijn graphs

2020

View full text Add to dashboard Cite

Memory consumption of de Bruijn graphs is often prohibitive. Most de Bruijn graph-based assemblers reduce the complexity by compacting paths into single vertices, but this is challenging as it requires the uncompacted de Bruijn graph to be available in memory. We present a parallel and memory-efficient algorithm enabling the direct construction of the compacted de Bruijn graph without producing the intermediate uncompacted graph. Bifrost features a broad range of functions, such as indexing, editing, and querying the graph, and includes a graph coloring method that maps each k-mer of the graph to the genomes it occurs in. Availability https://github.com/pmelsted/bifrost

show abstract

“…The maximal unitigs U can be computed efficiently [12][13][14] and combined with an auxiliary index to obtain a membership data structure (i.e. one that can efficiently determine if a k-mer belongs to K or not).…”

Section: Introductionmentioning

confidence: 99%

“…one that can efficiently determine if a k-mer belongs to K or not). In particular, Unitigs-FM [11] and deGSM [14] uses the FM-index as the auxiliary index, Pufferfish [15] and BLight [16] uses a minimum perfect hash function, and Bifrost [17] uses a minimizer hash table. Alternatively, U can be compressed to obtain a compressed disk representation of K, albeit without efficient support for membership queries prior to decompression.…”

Section: Introductionmentioning

confidence: 99%

Representation ofk-mer sets using spectrum-preserving string sets

Rahman

Medvedev

2020

Preprint

View full text Add to dashboard Cite

Given the popularity and elegance of k-mer based tools, finding a space-efficient way to represent a set of k-mers is important for improving the scalability of bioinformatics analyses. One popular approach is to convert the set of k-mers into the more compact set of unitigs. We generalize this approach and formulate it as the problem of finding a smallest spectrum-preserving string set (SPSS) representation. We show that this problem is equivalent to finding a smallest path cover in a compacted de Bruijn graph. Using this reduction, we prove a lower bound on the size of the optimal SPSS and propose a greedy method called UST that results in a smaller representation than unitigs and is nearly optimal with respect to our lower bound. We demonstrate the usefulness of the SPSS formulation with two applications of UST. The first one is a compression algorithm, UST-Compress, which we show can store a set of k-mers using an order-of-magnitude less disk space than other lossless compression tools. The second one is an exact static k-mer membership index, UST-FM, which we show improves index size by 10-44% compared to other state-of-the-art low memory indices. Our tool is publicly available at: https://github.com/medvedevgroup/UST/.

show abstract

deGSM: Memory Scalable Construction Of Large Scale de Bruijn Graph

Cited by 17 publications

References 43 publications

Simplitigs as an efficient and scalable representation of de Bruijn graphs

Simplitigs as an efficient and scalable representation of de Bruijn graphs

Bifrost: highly parallel construction and indexing of colored and compacted de Bruijn graphs

Representation ofk-mer sets using spectrum-preserving string sets

Contact Info

Product

Resources

About