Textual data compression in computational biology: a synopsis

Giancarlo, Raffaele; Scaturro, Dalila; Utro, Filippo

doi:10.1093/bioinformatics/btp117

Cited by 87 publications

(75 citation statements)

References 113 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our scheme can also be thought of as a self-index for a given multiple alignment of a sequence collection, where one can retrieve any part of any sequence as well as make queries on the content of all the aligned sequences. This is an extension of the classical objective of DNA compression in the vertical mode Tahi, 1993, 1994;Giancarlo et al, 2009), namely, where the goal is to compress a collection of genomes such that each sequence is compressed by making use of information contained in the entire set.…”

Section: Contentmentioning

confidence: 99%

Storage and Retrieval of Highly Repetitive Sequence Collections

Mäkinen

Navarro

Sirén

et al. 2010

Journal of Computational Biology

189

243

View full text Add to dashboard Cite

A repetitive sequence collection is a set of sequences which are small variations of each other. A prominent example are genome sequences of individuals of the same or close species, where the differences can be expressed by short lists of basic edit operations. Flexible and efficient data analysis on such a typically huge collection is plausible using suffix trees. However, the suffix tree occupies much space, which very soon inhibits in-memory analyses. Recent advances in full-text indexing reduce the space of the suffix tree to, essentially, that of the compressed sequences, while retaining its functionality with only a polylogarithmic slowdown. However, the underlying compression model considers only the predictability of the next sequence symbol given the k previous ones, where k is a small integer. This is unable to capture longer-term repetitiveness. For example, r identical copies of an incompressible sequence will be incompressible under this model. We develop new static and dynamic full-text indexes that are able of capturing the fact that a collection is highly repetitive, and require space basically proportional to the length of one typical sequence plus the total number of edit operations. The new indexes can be plugged into a recent dynamic fully-compressed suffix tree, achieving full functionality for sequence analysis, while retaining the reduced space and the polylogarithmic slowdown. Our experimental results confirm the practicality of our proposal.

show abstract

Section: Contentmentioning

confidence: 99%

Storage and Retrieval of Highly Repetitive Sequence Collections

Mäkinen

Navarro

Sirén

et al. 2010

Journal of Computational Biology

189

243

View full text Add to dashboard Cite

show abstract

“…Since a provocative 1999 study that advocated the incompressibility of proteomes (Nevill-Manning and Witten, 1999), there has been a modest flourishing of compression techniques tuned for long concatenations of polypeptides, spanning both the substitutional and the statistical realms (Giancarlo et al, 2009). We mention, among others, techniques consisting of instantiating the PPM algorithm with contexts of multiple lengths, weighted by amino acid mutation probabilities (Nevill-Manning and Witten, 1999); searching for exact and approximate reverse complements, repeats, and weighted context trees (Matsumoto et al, 2000); partitioning amino acids according to their frequency and invoking popular text compressors (Sampath, 2003); using amino acid substitution matrices to guide the creation of Huffman codes (Hategan and Tabus, 2004); building an off-line dictionary of variable-gap subsequences, constrained to be maximal in density and extension and to occur sufficiently frequently in the dataset (Apostolico et al, 2006); using panels of weighted experts that estimate the probability of a symbol using Markov models encoding species information, local context information; and repeated and complementary reversed substrings (Cao et al, 2007).…”

mentioning

confidence: 99%

The Subsequence Composition of Polypeptides

Apostolico

Cunial

2010

Journal of Computational Biology

View full text Add to dashboard Cite

The quantitative underpinning of the information content of biosequences represents an elusive goal and yet also an obvious prerequisite to the quantitative modeling and study of biological function and evolution. Several past studies have addressed the question of what distinguishes biosequences from random strings, the latter being clearly unpalatable to the living cell. Such studies typically analyze the organization of biosequences in terms of their constituent characters or substrings and have, in particular, consistently exposed a tenacious lack of compressibility on behalf of biosequences. This article attempts, perhaps for the first time, an assessement of the structure and randomness of polypeptides in terms on newly introduced parameters that relate to the vocabulary of their (suitably constrained) subsequences rather than their substrings. It is shown that such parameters grasp structural/functional information, and are related to each other under a specific set of rules that span biochemically diverse polypeptides. Measures on subsequences separate few amino acid strings from their random permutations, but show that the random permutations of most polypeptides amass along specific linear loci.

show abstract

“…We can easily define parallel arrays to also point to the position of the longest factor to permit easy access to these factors. Direct applications of our introduced data structures may include pattern substitution, detecting duplication [6], LZ decomposition in text compression [41], studying periodicity in strings [32,39], biological sequence compression [3,21], and analysis of repetition structures in DNA sequences [22,2]. Specifically, our pLF data structure may be used to identify how to best substitute a pattern or even determine if duplication is "hidden" by reversal or with parameterization.…”

Section: Discussionmentioning

confidence: 99%

Variations of the parameterized longest previous factor

Beal

Adjeroh

2012

Journal of Discrete Algorithms

View full text Add to dashboard Cite

The parameterized longest previous factor (pLPF) problem as defined for parameterized strings (p-strings) adds a level of parameterization to the longest previous factor (LPF) problem originally defined for traditional strings. In this work, we consider the construction of the pLPF data structure and identify the strong relationship between the pLPF linear time construction and several variations of the problem. Initially, we propose a taxonomy of classes for longest factor problems. Using this taxonomy, we show an interesting connection between the pLPF and popular data structures. It is shown that a subset of longest factor problems may be created with the pLPF construction. More specifically, the pLPF problem is used as a foundation to achieve the linear time construction of popular data structures such as the LCP, parameterized-LCP (pLCP), parameterized-border (p-border) array, and border array. We further generalize the permuted-LCP for p-strings and provide a linear time construction. A number of new variations of the pLPF problem are proposed and addressed in linear time for both p-strings and traditional strings, including the longest not-equal factor (LneF), longest reverse factor (LrF), and longest factor (LF). The framework of the pLPF construction is exploited to efficiently address a multitude of data structures with prospects in various applications. Finally, we implement our algorithms and perform various experiments to confirm theoretical results.

show abstract

Textual data compression in computational biology: a synopsis

Cited by 87 publications

References 113 publications

Storage and Retrieval of Highly Repetitive Sequence Collections

Storage and Retrieval of Highly Repetitive Sequence Collections

The Subsequence Composition of Polypeptides

Variations of the parameterized longest previous factor

Contact Info

Product

Resources

About