AllSome Sequence Bloom Trees

Sun, Chen; Harris, Robert S.; Chikhi, Rayan; Medvedev, Paul

doi:10.1089/cmb.2017.0258

Cited by 32 publications

(22 citation statements)

References 38 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The resulting structure is usually referred to as a colored de Bruijn graph [19] and its representations have been widely studied ( [50][51][52][53][54][55][56][57][58][59][60][61] ). Even though we touched this setting in the section Multiple pan-genomes, exploiting the similarity between individual de Bruijn graphs for further compression in simplitig-based approaches is to be addressed in future work.…”

Section: Discussionmentioning

confidence: 99%

Simplitigs as an efficient and scalable representation of de Bruijn graphs

Břinda

Baym

Kucherov

2020

Preprint

View full text Add to dashboard Cite

MotivationDe Bruijn graphs play an essential role in computational biology, facilitating rapid alignment-free comparison of genomic datasets as well as reconstruction of underlying genomic sequences. Subsequently, an important question is how to efficiently represent, compress, and transmit de Bruijn graphs of the most common types of genomic data sets, such as sequencing reads, genomes, and pan-genomes. ResultsWe introduce simplitigs, an effective representation of de Bruijn graphs for alignment-free applications. Simplitigs are a generalization of unitigs and correspond to spellings of vertex-disjoint paths in a de Bruijn graph. We present an easy-to-plug-in greedy heuristic for their computation and provide a reference implementation in a program called ProphAsm. We use ProphAsm to compare the scaling of simplitigs and unitigs on a range of genomic datasets. We demonstrate that simplitigs are superior to unitigs in terms of the cumulative sequence length as well as of the number of sequences, and that they are sufficiently close to the theoretical bounds for practical applications. Finally, we demonstrate that, when combined with standard full-text indexes, simplitigs provide a scalable solution for k-mer search in pan-genomes. AvailabilityProphAsm is written in C++ and is available under the MIT license from De Bruijn graphs belong to the most popular graph representations of genomic datasets. They are defined as directed graphs where V is the set of all k-mers (i.e., subwords of a fixed length k) occurring in the V , ) G = ( E dataset with edges connecting a vertex v to a vertex w if there is a long prefix-suffix overlap between these v k − 1 and w. As follows from the definition, we can associate a de Bruijn graph with the underlying k-mer set and edges can be defined implicitly (unlike the edge-centric definition where k-mer sets are associated with edges [5] ). In this paper, we consider only vertex-centric graphs.De Bruijn graphs feature remarkable properties. First, their computation from data is easy and deterministic.Algorithms for enumerating and counting k-mers have been extensively studied and many programs are available [6][7][8][9] . If the datasets contain sequencing errors, the computation may also involve graph cleaning. This aims at removing those k-mers that are the result of sequencing errors and are due to their supposed randomness expected to be rare. Second, if k is chosen appropriately, de Bruijn graphs can capture substantial information about the entire molecules under sequencing as these correspond to (some of the) walks in the graphs, provided that sequencing was sufficiently deep. Third, de Bruijn graphs can be handled easily, which simplifies software development as well as dataset analysis and interpretation. These properties have led to a large variety of applications of de Bruijn graphs.De Bruijn graphs have been widely studied in the context of sequence assembly [10][11][12] . Here, their construction is typically the first step to the reconstruction of the genomes and transcr...

show abstract

Section: Discussionmentioning

confidence: 99%

Simplitigs as an efficient and scalable representation of de Bruijn graphs

Břinda

Baym

Kucherov

2020

Preprint

View full text Add to dashboard Cite

show abstract

“…Though inspired by the SBT and subsequent work, Mantis takes a completely different approach to this problem. Specifically, rather than adopting a hierarchy of Bloom filters, as suggested by previous approaches Kingsford, 2016, 2017;Sun et al, 2017), we build our system on top of the CQF (Pandey et al, 2017b), using this data structure both for counting and as a general key-value store. We combine this data structure with a color-encoding scheme similar to that adopted by Holley et al (2016) and Almodaresi et al (2017) for colored de Bruijn graph representation.…”

Section: Discussionmentioning

confidence: 99%

“…The resulting problem is coined as the experiment discovery problem, where the goal is to return all experiments that contain at least some user-defined q fraction of the k-mers present in the query string. The space and query time of the SBT structure has been further improved by Solomon and Kingsford (2017) and Sun et al (2017) by applying an All-Some set decomposition over the original sets of the SBT structure. This seminal work introduced both a formulation of this problem and the initial steps toward a solution.…”

Section: Introductionmentioning

confidence: 99%

Mantis: A Fast, Small, and Exact Large-Scale Sequence-Search Index

et al. 2018

View full text Add to dashboard Cite

Sequence-level searches on large collections of RNA sequencing experiments, such as the NCBI Sequence Read Archive (SRA), would enable one to ask many questions about the expression or variation of a given transcript in a population. Existing approaches, such as the sequence Bloom tree, suffer from fundamental limitations of the Bloom filter, resulting in slow build and query times, less-than-optimal space usage, and potentially large numbers of false-positives. This paper introduces Mantis, a space-efficient system that uses new data structures to index thousands of raw-read experiments and facilitates large-scale sequence searches. In our evaluation, index construction with Mantis is 6× faster and yields a 20% smaller index than the state-of-the-art split sequence Bloom tree (SSBT). For queries, Mantis is 6-108× faster than SSBT and has no false-positives or -negatives. For example, Mantis was able to search for all 200,400 known human transcripts in an index of 2,652 RNA sequencing experiments in 82 min; SSBT took close to 4 days.

show abstract

“…Those downsides are intensified in the colored de Bruijn graph for which the memory consumption of colors rapidly overtakes the vertices and edges memory usage [36]. For this reason, a lot of attention has been given to succinct data structures for building the colored de Bruijn graph [30,31,[36][37][38][39][40][41] and data structures for multi-set k-mer indexing [42][43][44][45][46][47]. In the following, we focus on tools for constructing compacted de Bruijn graphs (cdBGs) with or without colors.…”

Section: Introductionmentioning

confidence: 99%

Bifrost: highly parallel construction and indexing of colored and compacted de Bruijn graphs

2020

View full text Add to dashboard Cite

Memory consumption of de Bruijn graphs is often prohibitive. Most de Bruijn graph-based assemblers reduce the complexity by compacting paths into single vertices, but this is challenging as it requires the uncompacted de Bruijn graph to be available in memory. We present a parallel and memory-efficient algorithm enabling the direct construction of the compacted de Bruijn graph without producing the intermediate uncompacted graph. Bifrost features a broad range of functions, such as indexing, editing, and querying the graph, and includes a graph coloring method that maps each k-mer of the graph to the genomes it occurs in. Availability https://github.com/pmelsted/bifrost

show abstract

AllSome Sequence Bloom Trees

Cited by 32 publications

References 38 publications

Simplitigs as an efficient and scalable representation of de Bruijn graphs

Simplitigs as an efficient and scalable representation of de Bruijn graphs

Mantis: A Fast, Small, and Exact Large-Scale Sequence-Search Index

Bifrost: highly parallel construction and indexing of colored and compacted de Bruijn graphs

Contact Info

Product

Resources

About