2009
DOI: 10.1093/bioinformatics/btp117
|View full text |Cite
|
Sign up to set email alerts
|

Textual data compression in computational biology: a synopsis

Abstract: It goes without saying that most of the research results reviewed here offer software prototypes to the bioinformatics community. The Supplementary Material provides pointers to software and benchmark datasets for a range of applications of broad interest. In addition to provide reference to software, the Supplementary Material also gives a brief presentation of some fundamental results and techniques related to this paper. It is at: http://www.math.unipa.it/ approximately raffaele/suppMaterial/compReview/

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
75
0

Year Published

2010
2010
2019
2019

Publication Types

Select...
5
3
2

Relationship

0
10

Authors

Journals

citations
Cited by 87 publications
(75 citation statements)
references
References 113 publications
0
75
0
Order By: Relevance
“…Our scheme can also be thought of as a self-index for a given multiple alignment of a sequence collection, where one can retrieve any part of any sequence as well as make queries on the content of all the aligned sequences. This is an extension of the classical objective of DNA compression in the vertical mode Tahi, 1993, 1994;Giancarlo et al, 2009), namely, where the goal is to compress a collection of genomes such that each sequence is compressed by making use of information contained in the entire set.…”
Section: Contentmentioning
confidence: 99%
“…Our scheme can also be thought of as a self-index for a given multiple alignment of a sequence collection, where one can retrieve any part of any sequence as well as make queries on the content of all the aligned sequences. This is an extension of the classical objective of DNA compression in the vertical mode Tahi, 1993, 1994;Giancarlo et al, 2009), namely, where the goal is to compress a collection of genomes such that each sequence is compressed by making use of information contained in the entire set.…”
Section: Contentmentioning
confidence: 99%
“…Since a provocative 1999 study that advocated the incompressibility of proteomes (Nevill-Manning and Witten, 1999), there has been a modest flourishing of compression techniques tuned for long concatenations of polypeptides, spanning both the substitutional and the statistical realms (Giancarlo et al, 2009). We mention, among others, techniques consisting of instantiating the PPM algorithm with contexts of multiple lengths, weighted by amino acid mutation probabilities (Nevill-Manning and Witten, 1999); searching for exact and approximate reverse complements, repeats, and weighted context trees (Matsumoto et al, 2000); partitioning amino acids according to their frequency and invoking popular text compressors (Sampath, 2003); using amino acid substitution matrices to guide the creation of Huffman codes (Hategan and Tabus, 2004); building an off-line dictionary of variable-gap subsequences, constrained to be maximal in density and extension and to occur sufficiently frequently in the dataset (Apostolico et al, 2006); using panels of weighted experts that estimate the probability of a symbol using Markov models encoding species information, local context information; and repeated and complementary reversed substrings (Cao et al, 2007).…”
mentioning
confidence: 99%
“…We can easily define parallel arrays to also point to the position of the longest factor to permit easy access to these factors. Direct applications of our introduced data structures may include pattern substitution, detecting duplication [6], LZ decomposition in text compression [41], studying periodicity in strings [32,39], biological sequence compression [3,21], and analysis of repetition structures in DNA sequences [22,2]. Specifically, our pLF data structure may be used to identify how to best substitute a pattern or even determine if duplication is "hidden" by reversal or with parameterization.…”
Section: Discussionmentioning
confidence: 99%