Graphical pan-genome analysis with compressed suffix trees and the Burrows–Wheeler transform

Baier, Uwe; Beller, Timo; Ohlebusch, Enno

doi:10.1093/bioinformatics/btv603

Cited by 67 publications

(85 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Also, retrieving Single Nucleotide Polymorphisms (SNP) data-detecting and localizing one base mutations on genomic data [160]. Other application is aligning DNA sequences using compression [161]; Using data compression to detect large transformations between the DNA of different individuals or species, also known as rearrangement detection, has also been shown to work efficiently [162]; It has also been used for efficient storage of data structures in pan-genome analysis, namely using de Bruijn graphs [163,164]. Here, the problem is to deal with large amounts of information and its fast retrieval.…”

Section: Discussionmentioning

confidence: 99%

A Survey on Data Compression Methods for Biological Sequences

2016

View full text Add to dashboard Cite

Abstract:The ever increasing growth of the production of high-throughput sequencing data poses a serious challenge to the storage, processing and transmission of these data. As frequently stated, it is a data deluge. Compression is essential to address this challenge-it reduces storage space and processing costs, along with speeding up data transmission. In this paper, we provide a comprehensive survey of existing compression approaches, that are specialized for biological data, including protein and DNA sequences. Also, we devote an important part of the paper to the approaches proposed for the compression of different file formats, such as FASTA, as well as FASTQ and SAM/BAM, which contain quality scores and metadata, in addition to the biological sequences. Then, we present a comparison of the performance of several methods, in terms of compression ratio, memory usage and compression/decompression time. Finally, we present some suggestions for future research on biological data compression.

show abstract

Section: Discussionmentioning

confidence: 99%

A Survey on Data Compression Methods for Biological Sequences

2016

View full text Add to dashboard Cite

show abstract

“…The C-DBG is directly constructed in a compressed way, where a non-branching path is stored in a single vertex, using an augmented suffix tree. Baier et al [20] improved SplitMEM in theory and practice with two algorithms that use the BWT and a compressed suffix tree. Unfortunately, both tools use more memory than the original size of the input sequences.…”

Section: Existing Approachesmentioning

confidence: 99%

Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage

Holley

Wittler

Stoye

2016

Algorithms Mol Biol

View full text Add to dashboard Cite

BackgroundHigh throughput sequencing technologies have become fast and cheap in the past years. As a result, large-scale projects started to sequence tens to several thousands of genomes per species, producing a high number of sequences sampled from each genome. Such a highly redundant collection of very similar sequences is called a pan-genome. It can be transformed into a set of sequences “colored” by the genomes to which they belong. A colored de Bruijn graph (C-DBG) extracts from the sequences all colored k-mers, strings of length k, and stores them in vertices.ResultsIn this paper, we present an alignment-free, reference-free and incremental data structure for storing a pan-genome as a C-DBG: the bloom filter trie (BFT). The data structure allows to store and compress a set of colored k-mers, and also to efficiently traverse the graph. Bloom filter trie was used to index and query different pangenome datasets. Compared to another state-of-the-art data structure, BFT was up to two times faster to build while using about the same amount of main memory. For querying k-mers, BFT was about 52–66 times faster while using about 5.5–14.3 times less memory.ConclusionWe present a novel succinct data structure called the Bloom Filter Trie for indexing a pan-genome as a colored de Bruijn graph. The trie stores k-mers and their colors based on a new representation of vertices that compress and index shared substrings. Vertices use basic data structures for lightweight substrings storage as well as Bloom filters for efficient trie and graph traversals. Experimental results prove better performance compared to another state-of-the-art data structure.Availabilityhttps://www.github.com/GuillaumeHolley/BloomFilterTrie.

show abstract

“…Despite serving as a building block for many methods in computational biology, the de Bruijn graph adoption is hindered by two factors. First, the memory usage and computational requirements for building de Bruijn graphs from raw sequencing reads are considerable compared to alignment to a reference genome while only a handful of tools have focused on de Bruijn graph compaction (Minkin et al, 2016;Chikhi et al, 2016;Marcus et al, 2014;Baier et al, 2016;Minkin et al, 2013). Second, de Bruijn graph construction usually requires tight integration with the code.…”

Section: Introductionmentioning

confidence: 99%

Bifrost – Highly parallel construction and indexing of colored and compacted de Bruijn graphs

Holley

Melsted

2019

Preprint

View full text Add to dashboard Cite

Motivation: De Bruijn graphs are the core data structure for a wide range of assemblers and genome analysis software processing High Throughput Sequencing datasets. For population genomic analysis, the colored de Bruijn graph is often used in order to take advantage of the massive sets of sequenced genomes available for each species. However, memory consumption of tools based on the de Bruijn graph is often prohibitive, due to the high number of vertices, edges or colors in the graph. In order to process large and complex genomes, most short-read assemblers based on the de Bruijn graph paradigm reduce the assembly complexity and memory usage by compacting first all maximal non-branching paths of the graph into single vertices. Yet, de Bruijn graph compaction is challenging as it requires the uncompacted de Bruijn graph to be available in memory. Results:We present a new parallel and memory efficient algorithm enabling the direct construction of the compacted de Bruijn graph without producing the intermediate uncompacted de Bruijn graph. Bifrost features a broad range of functions such as sequence querying, storage of user data alongside vertices and graph editing that automatically preserve the compaction property. Bifrost makes full use of the dynamic index efficiency and proposes a graph coloring method efficiently mapping each k-mer of the graph to the set of genomes in which it occurs. Experimental results show that our algorithm is competitive with state-of-the-art de Bruijn graph compaction and coloring tools. Bifrost was able to build the colored and compacted de Bruijn graph of about 118,000 Salmonella genomes on a mid-class server in about 4 days using 103 GB of main memory.Availability: https://github.com

show abstract

Graphical pan-genome analysis with compressed suffix trees and the Burrows–Wheeler transform

Abstract: https://www.uni-ulm.de/in/theo/research/seqana/.

Cited by 67 publications

References 17 publications

A Survey on Data Compression Methods for Biological Sequences

A Survey on Data Compression Methods for Biological Sequences

Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage

Bifrost – Highly parallel construction and indexing of colored and compacted de Bruijn graphs

Contact Info

Product

Resources

About