A Reference-Free Lossless Compression Algorithm for DNA Sequences Using a Competitive Prediction of Two Classes of Weighted Models

Pratas, Diogo; Hosseini, Morteza; Silva, J.; Pinho, Armando J.

doi:10.3390/e21111074

Cited by 16 publications

(8 citation statements)

References 90 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“… For DNA Sequence 5 (DS5), Jarvis uses the same configuration as in [ 64 ]; for DS4 and DS3 it uses Level 7. XM uses the default configuration.…”

Section: Resultsmentioning

confidence: 99%

“…From all the previous algorithms, the most efficient according to compression ratio in the wide diversity of DNA sequences are XM [ 43 ], GeCo2 [ 3 ], and Jarvis [ 64 ]. These compressors apply statistical and algorithmic model mixtures combined with arithmetic encoding.…”

Section: Introductionmentioning

confidence: 99%

“…The GeCo2 algorithm [ 3 ] uses soft-blending cooperation between context models and substitution-tolerant context models [ 5 ] with a specific forgetting factor for each model. The Jarvis compressor [ 64 ] uses a competitive prediction model to estimate, for each symbol, the best class of models to be used; there are 2 classes of models: weighted context models and weighted stochastic repeat models, where both classes of models use specific sub-programs to handle inverted repeats efficiently.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Efficient DNA sequence compression with neural networks

2020

Self Cite

View full text Add to dashboard Cite

Background The increasing production of genomic data has led to an intensified need for models that can cope efficiently with the lossless compression of DNA sequences. Important applications include long-term storage and compression-based data analysis. In the literature, only a few recent articles propose the use of neural networks for DNA sequence compression. However, they fall short when compared with specific DNA compression tools, such as GeCo2. This limitation is due to the absence of models specifically designed for DNA sequences. In this work, we combine the power of neural networks with specific DNA models. For this purpose, we created GeCo3, a new genomic sequence compressor that uses neural networks for mixing multiple context and substitution-tolerant context models. Findings We benchmark GeCo3 as a reference-free DNA compressor in 5 datasets, including a balanced and comprehensive dataset of DNA sequences, the Y-chromosome and human mitogenome, 2 compilations of archaeal and virus genomes, 4 whole genomes, and 2 collections of FASTQ data of a human virome and ancient DNA. GeCo3 achieves a solid improvement in compression over the previous version (GeCo2) of $2.4\%$, $7.1\%$, $6.1\%$, $5.8\%$, and $6.0\%$, respectively. To test its performance as a reference-based DNA compressor, we benchmark GeCo3 in 4 datasets constituted by the pairwise compression of the chromosomes of the genomes of several primates. GeCo3 improves the compression in $12.4\%$, $11.7\%$, $10.8\%$, and $10.1\%$ over the state of the art. The cost of this compression improvement is some additional computational time (1.7–3 times slower than GeCo2). The RAM use is constant, and the tool scales efficiently, independently of the sequence size. Overall, these values outperform the state of the art. Conclusions GeCo3 is a genomic sequence compressor with a neural network mixing approach that provides additional gains over top specific genomic compressors. The proposed mixing method is portable, requiring only the probabilities of the models as inputs, providing easy adaptation to other data compressors or compression-based data analysis tools. GeCo3 is released under GPLv3 and is available for free download at https://github.com/cobilab/geco3.

show abstract

“… For DNA Sequence 5 (DS5), Jarvis uses the same configuration as in [ 64 ]; for DS4 and DS3 it uses Level 7. XM uses the default configuration.…”

Section: Resultsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Efficient DNA sequence compression with neural networks

2020

Self Cite

View full text Add to dashboard Cite

show abstract

“…We tested all DNA sequence compressors that are available and functional in 2020: dnaX [ 14 ], XM [ 15 ], DELIMINATE [ 16 ], Pufferfish [ 17 ], DNA-COMPACT [ 18 ], MFCompress [ 19 ], UHT [ 20 ], GeCo [ 21 ], GeCo2 [ 22 ], JARVIS [ 23 ], NAF [ 24 ], and NUHT [ 25 ]. We also included the relatively compact among homology search database formats: BLAST [ 26 ] and 2bit—a database format of BLAT [ 27 ].…”

Section: Resultsmentioning

confidence: 99%

Sequence Compression Benchmark (SCB) database—A comprehensive evaluation of reference-free compressors for FASTA-formatted sequences

et al. 2020

View full text Add to dashboard Cite

Abstract Background Nearly all molecular sequence databases currently use gzip for data compression. Ongoing rapid accumulation of stored data calls for a more efficient compression tool. Although numerous compressors exist, both specialized and general-purpose, choosing one of them was difficult because no comprehensive analysis of their comparative advantages for sequence compression was available. Findings We systematically benchmarked 430 settings of 48 compressors (including 29 specialized sequence compressors and 19 general-purpose compressors) on representative FASTA-formatted datasets of DNA, RNA, and protein sequences. Each compressor was evaluated on 17 performance measures, including compression strength, as well as time and memory required for compression and decompression. We used 27 test datasets including individual genomes of various sizes, DNA and RNA datasets, and standard protein datasets. We summarized the results as the Sequence Compression Benchmark database (SCB database, http://kirr.dyndns.org/sequence-compression-benchmark/), which allows custom visualizations to be built for selected subsets of benchmark results. Conclusion We found that modern compressors offer a large improvement in compactness and speed compared to gzip. Our benchmark allows compressors and their settings to be compared using a variety of performance measures, offering the opportunity to select the optimal compressor on the basis of the data type and usage scenario specific to a particular application.

show abstract

“…Поэтому интерес представляют методы со степенью сжатия, превышающей 75 %. Первые инструменты для сжатия ДНК-последовательностей разработаны в 1993-1994 годах [17,8] и продолжают появляться в наши дни (см., например, [19]). Наряду с алгоритмами сжатия индивидуальных ДНК-последовательностей разрабатываются и «вертикальные» алгоритмы, ориентированные на кодирование с использованием эталонных (референсных) последовательностей и фиксирующие только различия в целевом и эталонном текстах.…”

Section: Introductionunclassified

The complexity of DNA sequences. Different approaches and definitions

Gusev¹,

Miroshnichenko²

2020

Math.Biol.Bioinf.

View full text Add to dashboard Cite

An important quantitative characteristic of symbolic sequence (texts, strings) is complexity, which reflects at the intuitive level the degree of their "non-randomness". A.N. Kolmogorov formulated the most general definition of complexity. He proposed measuring the complexity of an object (symbolic sequence) by the length of the shortest descriptions by which this object can be uniquely reconstructed. Since there is no program guaranteed to search for the shortest description, in practice, various algorithmic approximations considered in this paper are used for this purpose. Along with definitions of complexity, suggesting the possibility of reconstruction a sequence from its "description", a number of measures are considered that do not imply such restoration. They are based on the calculation of some quantitative characteristics. Of interest is not only a quantitative assessment of complexity, but also the identification and classification of structural regularities that determine its specific value. In one form or another, they are expressed in the demonstration of repetition in the broadest sense. The considered measures of complexity are conventionally divided into statistical ones that take into account the frequency of occurrence of symbols or short “words” in the text, “dictionary” ones that estimate the number of different “subwords” and “structural” ones based on the identification of long repeating fragments of text and the determination of relationships between them. Most of the methods are designed for sequences of an arbitrary linguistic nature. The special attention paid to DNA sequences, reflected in the title of the article, is due to the importance of the object, manifestations of repetition of different types, and numerous examples of using the concept of complexity in solving problems of classification and evolution of various biological objects. Local structural features found in the sliding window mode in DNA sequences are of considerable interest, since zones of low complexity in the genomes of various organisms are often associated with the regulation of basic genetic processes.

show abstract

A Reference-Free Lossless Compression Algorithm for DNA Sequences Using a Competitive Prediction of Two Classes of Weighted Models

Cited by 16 publications

References 90 publications

Efficient DNA sequence compression with neural networks

Efficient DNA sequence compression with neural networks

Sequence Compression Benchmark (SCB) database—A comprehensive evaluation of reference-free compressors for FASTA-formatted sequences

The complexity of DNA sequences. Different approaches and definitions

Contact Info

Product

Resources

About