Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage

Holley, Guillaume; Wittler, Roland; Stoye, Jens

doi:10.1186/s13015-016-0066-8

Cited by 75 publications

(64 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The resulting structure is usually referred to as a colored de Bruijn graph [19] and its representations have been widely studied ( [50][51][52][53][54][55][56][57][58][59][60][61] ). Even though we touched this setting in the section Multiple pan-genomes, exploiting the similarity between individual de Bruijn graphs for further compression in simplitig-based approaches is to be addressed in future work.…”

Section: Discussionmentioning

confidence: 99%

Simplitigs as an efficient and scalable representation of de Bruijn graphs

Břinda

Baym

Kucherov

2020

Preprint

View full text Add to dashboard Cite

MotivationDe Bruijn graphs play an essential role in computational biology, facilitating rapid alignment-free comparison of genomic datasets as well as reconstruction of underlying genomic sequences. Subsequently, an important question is how to efficiently represent, compress, and transmit de Bruijn graphs of the most common types of genomic data sets, such as sequencing reads, genomes, and pan-genomes. ResultsWe introduce simplitigs, an effective representation of de Bruijn graphs for alignment-free applications. Simplitigs are a generalization of unitigs and correspond to spellings of vertex-disjoint paths in a de Bruijn graph. We present an easy-to-plug-in greedy heuristic for their computation and provide a reference implementation in a program called ProphAsm. We use ProphAsm to compare the scaling of simplitigs and unitigs on a range of genomic datasets. We demonstrate that simplitigs are superior to unitigs in terms of the cumulative sequence length as well as of the number of sequences, and that they are sufficiently close to the theoretical bounds for practical applications. Finally, we demonstrate that, when combined with standard full-text indexes, simplitigs provide a scalable solution for k-mer search in pan-genomes. AvailabilityProphAsm is written in C++ and is available under the MIT license from De Bruijn graphs belong to the most popular graph representations of genomic datasets. They are defined as directed graphs where V is the set of all k-mers (i.e., subwords of a fixed length k) occurring in the V , ) G = ( E dataset with edges connecting a vertex v to a vertex w if there is a long prefix-suffix overlap between these v k − 1 and w. As follows from the definition, we can associate a de Bruijn graph with the underlying k-mer set and edges can be defined implicitly (unlike the edge-centric definition where k-mer sets are associated with edges [5] ). In this paper, we consider only vertex-centric graphs.De Bruijn graphs feature remarkable properties. First, their computation from data is easy and deterministic.Algorithms for enumerating and counting k-mers have been extensively studied and many programs are available [6][7][8][9] . If the datasets contain sequencing errors, the computation may also involve graph cleaning. This aims at removing those k-mers that are the result of sequencing errors and are due to their supposed randomness expected to be rare. Second, if k is chosen appropriately, de Bruijn graphs can capture substantial information about the entire molecules under sequencing as these correspond to (some of the) walks in the graphs, provided that sequencing was sufficiently deep. Third, de Bruijn graphs can be handled easily, which simplifies software development as well as dataset analysis and interpretation. These properties have led to a large variety of applications of de Bruijn graphs.De Bruijn graphs have been widely studied in the context of sequence assembly [10][11][12] . Here, their construction is typically the first step to the reconstruction of the genomes and transcr...

show abstract

Section: Discussionmentioning

confidence: 99%

Simplitigs as an efficient and scalable representation of de Bruijn graphs

Břinda

Baym

Kucherov

2020

Preprint

View full text Add to dashboard Cite

show abstract

“…The second is BOSS, which, as mentioned previously, was shown [35] to have superior space usage. We did not compare against the Bloom filter trie [36], which is fast but uses an order of Table 2: Space usage of UST-Compress and others. We show the average number of bits per distinct k-mer in the dataset.…”

Section: Evaluation Of Ust-fmmentioning

confidence: 99%

“…Membership data structures for k-mer sets were surveyed in a recent paper [9]. In addition to the unitig-based approaches already mentioned, other exact representations include succinct de Bruijn graphs (referred to as BOSS [36]) and their variations [37,38], dynamic de Bruijn graphs [39,40], and Bloom filter tries [41]. Some data structures are non-static, i.e.…”

Section: Related Workmentioning

confidence: 99%

“…The second is BOSS, which, as mentioned previously, was shown [40] to have superior space usage. We did not compare against the Bloom filter trie [41], which is fast but uses an order of magnitude more memory than BOSS [40]. Other data structures, such as Pufferfish [15], blight [16], and Bifrost [17], implement more sophisticated operations and hence use significantly more memory than BOSS.…”

Section: Evaluation Of Ust-fmmentioning

confidence: 99%

See 1 more Smart Citation

Representation ofk-mer sets using spectrum-preserving string sets

Rahman

Medvedev

2020

Preprint

View full text Add to dashboard Cite

Given the popularity and elegance of k-mer based tools, finding a space-efficient way to represent a set of k-mers is important for improving the scalability of bioinformatics analyses. One popular approach is to convert the set of k-mers into the more compact set of unitigs. We generalize this approach and formulate it as the problem of finding a smallest spectrum-preserving string set (SPSS) representation. We show that this problem is equivalent to finding a smallest path cover in a compacted de Bruijn graph. Using this reduction, we prove a lower bound on the size of the optimal SPSS and propose a greedy method called UST that results in a smaller representation than unitigs and is nearly optimal with respect to our lower bound. We demonstrate the usefulness of the SPSS formulation with two applications of UST. The first one is a compression algorithm, UST-Compress, which we show can store a set of k-mers using an order-of-magnitude less disk space than other lossless compression tools. The second one is an exact static k-mer membership index, UST-FM, which we show improves index size by 10-44% compared to other state-of-the-art low memory indices. Our tool is publicly available at: https://github.com/medvedevgroup/UST/.

show abstract

“…Hence, this allows the matrix to be compressed and stored independently of the graph. Also several other methods have been developed to further compress, store, and manipulate the color matrix, including Rainbowfish (Almodaresi et al, 2017), Mantis (Pandey et al, 2018), Bloom Filter Trie (BFT) (Holley et al, 2016), and Bifrost (Holley and Melsted, 2019).…”

Section: Introductionmentioning

confidence: 99%

Succinct Dynamic de Bruijn Graphs

Alipanahi

Kuhnle

Puglisi

et al. 2020

Preprint

View full text Add to dashboard Cite

Motivation:The de Bruijn graph is one of the fundamental data structures for analysis of high throughput sequencing data. In order to be applicable to population-scale studies, it is essential to build and store the graph in a space-and time-efficient manner. In addition, due to the ever-changing nature of population studies, it has become essential to update the graph after construction e.g. add and remove nodes and edges. Although there has been substantial effort on making the construction and storage of the graph efficient, there is a limited amount of work in building the graph in an efficient and mutable manner. Hence, most space efficient data structures require complete reconstruction of the graph in order to add or remove edges or nodes. Results: In this paper we present DynamicBOSS, a succinct representation of the de Bruijn graph that allows for an unlimited number of additions and deletions of nodes and edges. We compare our method with other competing methods and demonstrate that DynamicBOSS is the only method that supports both addition and deletion and is applicable to very large samples (e.g. greater than 15 billion k-mers). Competing dynamic methods e.g., FDBG (Crawford et al., 2018) cannot be constructed on large scale datasets, or cannot support both addition and deletion e.g., BiFrost

show abstract

Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage

Cited by 75 publications

References 21 publications

Simplitigs as an efficient and scalable representation of de Bruijn graphs

Simplitigs as an efficient and scalable representation of de Bruijn graphs

Representation ofk-mer sets using spectrum-preserving string sets

Succinct Dynamic de Bruijn Graphs

Contact Info

Product

Resources

About