Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph

Benoit, Gaëtan; Lemaitre, Claire; Lavenier, Dominique; Drézen, Erwan; Dayris, Thibault; Uricaru, Raluca; Rizk, Guillaume

doi:10.1186/s12859-015-0709-7

Cited by 91 publications

(74 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In 1993 the first specialized DNA compressor was proposed (Grumbach and Tahi, 1993). Since then, numerous DNA compressors were developed (e.g., Cao et al, 2007, Li et al, 2013, Benoit et al, 2015, Al-Okaily et al, 2017. In our experience only two compressors pass the practicality threshold: DELIMINATE (Mohammed et al, 2012) and MFCompress (Pinho and Pratas, 2014).…”

Section: Introductionmentioning

confidence: 88%

Nucleotide Archival Format (NAF) enables efficient lossless reference-free compression of DNA sequences

Kryukov

Ueda

Nakagawa

et al. 2018

Preprint

View full text Add to dashboard Cite

DNA sequence databases use compression such as gzip to reduce the required storage space and network transmission time. We describe Nucleotide Archival Format (NAF)a new file format for lossless reference-free compression of FASTA and FASTQ-formatted nucleotide sequences. NAF compression ratio is comparable to the best DNA compressors, while providing dramatically faster decompression. We compared our format with DNA compressors: DELIMINATE and MFCompress, and with general purpose compressors: gzip, bzip2, xz, brotli, and zstd.Availability: NAF compressor and decompressor, as well as format specification are available at https://github.com/KirillKryukov/naf. Format specification is in public domain. Compressor and decompressor are open source under the zlib/libpng license, free for nearly any use.

show abstract

Section: Introductionmentioning

confidence: 88%

Nucleotide Archival Format (NAF) enables efficient lossless reference-free compression of DNA sequences

Kryukov

Ueda

Nakagawa

et al. 2018

Preprint

View full text Add to dashboard Cite

show abstract

“…Although SMS (Single Molecule Sequencing) technologies (Rang et al, 2018;Rhoads and Au, 2015) have re-introduced the OLC framework as the method of choice to assemble long and erroneous reads (Koren et al, 2017;Li, 2016;Chin et al, 2016;Kamath et al, 2017), de Bruijn graph based methods are nonetheless used to assemble and correct long reads (Salmela and Rivals, 2014;Ruan and Li, 2019). Overall, the de Bruijn graphs have found widespread use for a variety of problems such as de novo transcriptome assembly (Robertson et al, 2010), variant calling (Uricaru et al, 2015), short read compression (Benoit et al, 2015), short read correction (Limasset et al, 2019), long read correction (Salmela and Rivals, 2014) and short read mapping (Liu et al, 2016) to name a few. The colored de Bruijn graph is a variant of the de Bruijn graph which keeps track of the source of each vertex in the graph (Iqbal et al, 2012).…”

Section: Introductionmentioning

confidence: 99%

Bifrost – Highly parallel construction and indexing of colored and compacted de Bruijn graphs

Holley

Melsted

2019

Preprint

View full text Add to dashboard Cite

Motivation: De Bruijn graphs are the core data structure for a wide range of assemblers and genome analysis software processing High Throughput Sequencing datasets. For population genomic analysis, the colored de Bruijn graph is often used in order to take advantage of the massive sets of sequenced genomes available for each species. However, memory consumption of tools based on the de Bruijn graph is often prohibitive, due to the high number of vertices, edges or colors in the graph. In order to process large and complex genomes, most short-read assemblers based on the de Bruijn graph paradigm reduce the assembly complexity and memory usage by compacting first all maximal non-branching paths of the graph into single vertices. Yet, de Bruijn graph compaction is challenging as it requires the uncompacted de Bruijn graph to be available in memory. Results:We present a new parallel and memory efficient algorithm enabling the direct construction of the compacted de Bruijn graph without producing the intermediate uncompacted de Bruijn graph. Bifrost features a broad range of functions such as sequence querying, storage of user data alongside vertices and graph editing that automatically preserve the compaction property. Bifrost makes full use of the dynamic index efficiency and proposes a graph coloring method efficiently mapping each k-mer of the graph to the set of genomes in which it occurs. Experimental results show that our algorithm is competitive with state-of-the-art de Bruijn graph compaction and coloring tools. Bifrost was able to build the colored and compacted de Bruijn graph of about 118,000 Salmonella genomes on a mid-class server in about 4 days using 103 GB of main memory.Availability: https://github.com

show abstract

“…For example, sequence assembly algorithms use k-mers as nodes in the de Bruijn graph (Zerbino and Birney, 2008;Pell et al, 2012), metagenomic sample diversity can be quantified by comparing the sample's k-mer content against a database (Wood and Salzberg, 2014), k-mer content derived from RNA-seq reads can inform gene expression estimation procedures (Patro et al, 2014), and k-mer-based algorithms can dramatically improve compression of sequence (Rozov et al, 2014;Benoit et al, 2015) and quality values (Yu et al, 2014).…”

Section: Introduction Mmentioning

confidence: 99%

Improving Bloom Filter Performance on Sequence Data Using $$k$$ -mer Bloom Filters

Pellow

Filippova

Kingsford

2016

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Using a sequence's k-mer content rather than the full sequence directly has enabled significant performance improvements in several sequencing applications, such as metagenomic species identification, estimation of transcript abundances, and alignment-free comparison of sequencing data. As k-mer sets often reach hundreds of millions of elements, traditional data structures are often impractical for k-mer set storage, and Bloom filters (BFs) and their variants are used instead. BFs reduce the memory footprint required to store millions of k-mers while allowing for fast set containment queries, at the cost of a low false positive rate (FPR). We show that, because k-mers are derived from sequencing reads, the information about k-mer overlap in the original sequence can be used to reduce the FPR up to 30 · with little or no additional memory and with set containment queries that are only 1:3 -1:6 times slower. Alternatively, we can leverage k-mer overlap information to store k-mer sets in about half the space while maintaining the original FPR. We consider several variants of such k-mer Bloom filters (kBFs), derive theoretical upper bounds for their FPR, and discuss their range of applications and limitations.

show abstract

Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph

Cited by 91 publications

References 32 publications

Nucleotide Archival Format (NAF) enables efficient lossless reference-free compression of DNA sequences

Nucleotide Archival Format (NAF) enables efficient lossless reference-free compression of DNA sequences

Bifrost – Highly parallel construction and indexing of colored and compacted de Bruijn graphs

Improving Bloom Filter Performance on Sequence Data Using $$k$$ -mer Bloom Filters

Contact Info

Product

Resources

About