2015
DOI: 10.1186/s12859-015-0709-7
|View full text |Cite
|
Sign up to set email alerts
|

Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph

Abstract: BackgroundData volumes generated by next-generation sequencing (NGS) technologies is now a major concern for both data storage and transmission. This triggered the need for more efficient methods than general purpose compression tools, such as the widely used gzip method.ResultsWe present a novel reference-free method meant to compress data issued from high throughput sequencing technologies. Our approach, implemented in the software Leon, employs techniques derived from existing assembly principles. The metho… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

2
72
0

Year Published

2016
2016
2023
2023

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 91 publications
(74 citation statements)
references
References 32 publications
2
72
0
Order By: Relevance
“…In 1993 the first specialized DNA compressor was proposed (Grumbach and Tahi, 1993). Since then, numerous DNA compressors were developed (e.g., Cao et al, 2007, Li et al, 2013, Benoit et al, 2015, Al-Okaily et al, 2017. In our experience only two compressors pass the practicality threshold: DELIMINATE (Mohammed et al, 2012) and MFCompress (Pinho and Pratas, 2014).…”
Section: Introductionmentioning
confidence: 88%
“…In 1993 the first specialized DNA compressor was proposed (Grumbach and Tahi, 1993). Since then, numerous DNA compressors were developed (e.g., Cao et al, 2007, Li et al, 2013, Benoit et al, 2015, Al-Okaily et al, 2017. In our experience only two compressors pass the practicality threshold: DELIMINATE (Mohammed et al, 2012) and MFCompress (Pinho and Pratas, 2014).…”
Section: Introductionmentioning
confidence: 88%
“…Although SMS (Single Molecule Sequencing) technologies (Rang et al, 2018;Rhoads and Au, 2015) have re-introduced the OLC framework as the method of choice to assemble long and erroneous reads (Koren et al, 2017;Li, 2016;Chin et al, 2016;Kamath et al, 2017), de Bruijn graph based methods are nonetheless used to assemble and correct long reads (Salmela and Rivals, 2014;Ruan and Li, 2019). Overall, the de Bruijn graphs have found widespread use for a variety of problems such as de novo transcriptome assembly (Robertson et al, 2010), variant calling (Uricaru et al, 2015), short read compression (Benoit et al, 2015), short read correction (Limasset et al, 2019), long read correction (Salmela and Rivals, 2014) and short read mapping (Liu et al, 2016) to name a few. The colored de Bruijn graph is a variant of the de Bruijn graph which keeps track of the source of each vertex in the graph (Iqbal et al, 2012).…”
Section: Introductionmentioning
confidence: 99%
“…For example, sequence assembly algorithms use k-mers as nodes in the de Bruijn graph (Zerbino and Birney, 2008;Pell et al, 2012), metagenomic sample diversity can be quantified by comparing the sample's k-mer content against a database (Wood and Salzberg, 2014), k-mer content derived from RNA-seq reads can inform gene expression estimation procedures (Patro et al, 2014), and k-mer-based algorithms can dramatically improve compression of sequence (Rozov et al, 2014;Benoit et al, 2015) and quality values (Yu et al, 2014).…”
Section: Introduction Mmentioning
confidence: 99%