Bifrost – Highly parallel construction and indexing of colored and compacted de Bruijn graphs

Holley, Guillaume; Melsted, Páll

doi:10.1101/695338

Cited by 23 publications

(28 citation statements)

References 66 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In fact, for any concrete application, one might argue that a SPSS representation is too restrictive and can be improved. However, we chose to focus on SPSS representations because they are the common denominator in the applications of unitigbased representations we have observed [11,[15][16][17]. In this way, they retain broad applicability, as opposed to more specialized representations.…”

Section: Resultsmentioning

confidence: 99%

“…We did not compare against the Bloom filter trie [41], which is fast but uses an order of magnitude more memory than BOSS [40]. Other data structures, such as Pufferfish [15], blight [16], and Bifrost [17], implement more sophisticated operations and hence use significantly more memory than BOSS. Moreover, these make use of a unitig SPSS representation and hence could potentially themselves incorporate the UST approach.…”

Section: Evaluation Of Ust-fmmentioning

confidence: 99%

“…one that can efficiently determine if a k-mer belongs to K or not). In particular, Unitigs-FM [11] and deGSM [14] uses the FM-index as the auxiliary index, Pufferfish [15] and BLight [16] uses a minimum perfect hash function, and Bifrost [17] uses a minimizer hash table. Alternatively, U can be compressed to obtain a compressed disk representation of K, albeit without efficient support for membership queries prior to decompression.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Representation ofk-mer sets using spectrum-preserving string sets

Rahman

Medvedev

2020

Preprint

View full text Add to dashboard Cite

Given the popularity and elegance of k-mer based tools, finding a space-efficient way to represent a set of k-mers is important for improving the scalability of bioinformatics analyses. One popular approach is to convert the set of k-mers into the more compact set of unitigs. We generalize this approach and formulate it as the problem of finding a smallest spectrum-preserving string set (SPSS) representation. We show that this problem is equivalent to finding a smallest path cover in a compacted de Bruijn graph. Using this reduction, we prove a lower bound on the size of the optimal SPSS and propose a greedy method called UST that results in a smaller representation than unitigs and is nearly optimal with respect to our lower bound. We demonstrate the usefulness of the SPSS formulation with two applications of UST. The first one is a compression algorithm, UST-Compress, which we show can store a set of k-mers using an order-of-magnitude less disk space than other lossless compression tools. The second one is an exact static k-mer membership index, UST-FM, which we show improves index size by 10-44% compared to other state-of-the-art low memory indices. Our tool is publicly available at: https://github.com/medvedevgroup/UST/.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Evaluation Of Ust-fmmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Representation ofk-mer sets using spectrum-preserving string sets

Rahman

Medvedev

2020

Preprint

View full text Add to dashboard Cite

show abstract

“…kallisto have a heuristic based on the dBG to avoid looking up every k-mer, which we have not reimplemented as of now. We have also experimented with another library to build the dBG called bifrost (Holley, 2019), which is slightly faster, presumably because of the rolling hash they use to lookup k-mers. The built dBGs were essentially the same and therefore the classifying performance of Brume was unchanged.…”

Section: Discussionmentioning

confidence: 99%

Embedding the de Bruijn graph, and applications to metagenomics

Menegaux

Vert

2020

Preprint

View full text Add to dashboard Cite

Fast mapping of sequencing reads to taxonomic clades is a crucial step in metagenomics, which however raises computational challenges as the numbers of reads and of taxonomic clades increases. Besides alignment-based methods, which are accurate but computational costly, faster compositional approaches have recently been proposed to predict the taxonomic clade of a read based on the set of k-mers it contains. Machine learning-based compositional approaches, in particular, have recently reached accuracies similar to alignment-based models, while being considerably faster. It has been observed that the accuracy of these models increases with the length k of the k-mers they use, however existing methods are limited to handle k-mers of lengths up to k = 12 or 13 because of their large memory footprint needed to store the model coefficients for each possible k-mer. In order to explore the performance of machine learning-based compositional approaches for longer k-mers than currently possible, we propose to reduce the memory footprint of these methods by binning together k-mers that appear together in the sequencing reads used to train the models. We achieve this binning by learning a vector embedding for the vertices of a compacted de Bruijn graph, allowing us to embed any DNA sequence in a low-dimensional vector space where a machine learning system can be trained. The resulting method, which we call Brume, allows us to train compositional machine learning-based models with k-mers of length up to k = 31. We show on two metagenomics benchmark that Brume reaches better performance than previously achieved, thanks to the use of longer k-mers.

show abstract

“…To that extent, we implemented a user-friendly library along with different snippets to allow our method to be usable in practical cases. The challenge of indexing colored de Bruijn graphs [36] (or more generally to answer large sequence search problems as defined in [10]) have caught the interest of a community and could be a direct application of this work. As an example, BLight is successfully integrated as an indexing structure in REINDEER [34], a k-mer data structure that enables the quantification of query sequences in thousands of raw read samples.…”

Section: Discussionmentioning

confidence: 99%

Efficient exact associative structure for sequencing data

Marchet

Kerbiriou

Limasset

2019

Preprint

View full text Add to dashboard Cite

Motivation: A plethora of methods and applications share the fundamental need to associate information to words for high throughput sequence analysis. Indexing billions of k-mers is promptly a scalability problem, as exact associative indexes can be memory expensive. Recent works take advantage of the properties of the k-mer sets to leverage this challenge. They exploit the overlaps shared among k-mers by using a de Bruijn graph as a compact k-mer set to provide lightweight structures. Results: We present Blight, a static and exact index structure able to associate unique identifiers to indexed k-mers and to reject alien k-mers that scales to the largest kmer sets with a low memory cost. The proposed index combines an extremely compact representation along with very high throughput. Besides, its construction from the de Bruijn graph sequences is efficient and does not need supplementary memory. The efficient index implementation achieves to index the k-mers from the human genome with 8GB within 10 minutes and can scale up to the large axolotl genome with 63 GB within 76 minutes. Furthermore, while being memory efficient, the index allows above a million queries per second on a single CPU in our experiments, and the use of multiple cores raises its throughput. Finally, we also present how the index can practically represent metagenomic and transcriptomic sequencing data to highlight its wide applicative range. Availability: The index is implemented as a C++ library, is open source under AGPL3 license, and available at github.com/Malfoy/Blight. It is designed as a user-friendly library and comes along with samples code usage.

show abstract

Bifrost – Highly parallel construction and indexing of colored and compacted de Bruijn graphs

Cited by 23 publications

References 66 publications

Representation ofk-mer sets using spectrum-preserving string sets

Representation ofk-mer sets using spectrum-preserving string sets

Embedding the de Bruijn graph, and applications to metagenomics

Efficient exact associative structure for sequencing data

Contact Info

Product

Resources

About