Raffaele Giancarlo scite author profile

We provide a general boosting technique for Textual Data Compression. Qualitatively, it takes a good compression algorithm and turns it into an algorithm with a better compression performance guarantee. It displays the following remarkable properties: (a) it can turn any memoryless compressor into a compression algorithm that uses the “best possible” contexts; (b) it is very simple and optimal in terms of time; and (c) it admits a decompression algorithm again optimal in time. To the best of our knowledge, this is the first boosting technique displaying these properties.Technically, our boosting technique builds upon three main ingredients: the Burrows--Wheeler Transform, the Suffix Tree data structure, and a greedy algorithm to process them. Specifically, we show that there exists a proper partition of the Burrows--Wheeler Transform of a string s that shows a deep combinatorial relation with the k th order entropy of s . That partition can be identified via a greedy processing of the suffix tree of s with the aim of minimizing a proper objective function over its nodes. The final compressed string is then obtained by compressing individually each substring of the partition by means of the base compressor we wish to boost.Our boosting technique is inherently combinatorial because it does not need to assume any prior probabilistic model about the source emitting s , and it does not deploy any training, parameter estimation and learning. Various corollaries are derived from this main achievement. Among the others, we show analytically that using our booster, we get better compression algorithms than some of the best existing ones, that is, LZ77, LZ78, PPMC and the ones derived from the Burrows--Wheeler Transform. Further, we settle analytically some long-standing open problems about the algorithmic structure and the performance of BWT-based compressors. Namely, we provide the first family of BWT algorithms that do not use Move-To-Front or Symbol Ranking as a part of the compression process.

show abstract

Sparse dynamic programming I

Eppstein¹,

Galil²,

Giancarlo³

et al. 1992

J. ACM

127

118

View full text Add to dashboard Cite

Dynamic programming solutlons to a number of different recurrence equations for sequence comparison and for RNA secondary structure prediction are considered. These recurrences are defined over a number of points that is quadratic in the input size; however only a sparse set matters for the result. Efficient algorithms for these problems are given, when the weight functions used in the recurrences are taken to be linear. The time complexity of the algorithms depends almost linearly on the number of points that need to be considered; when the problems are sparse this results in a substantial speed-up over known algorithms.

show abstract

Speeding up the Consensus Clustering methodology for microarray data analysis

Giancarlo

Utro²

2011

Algorithms Mol Biol

101

View full text Add to dashboard Cite

BackgroundThe inference of the number of clusters in a dataset, a fundamental problem in Statistics, Data Analysis and Classification, is usually addressed via internal validation measures. The stated problem is quite difficult, in particular for microarrays, since the inferred prediction must be sensible enough to capture the inherent biological structure in a dataset, e.g., functionally related genes. Despite the rich literature present in that area, the identification of an internal validation measure that is both fast and precise has proved to be elusive. In order to partially fill this gap, we propose a speed-up of (Consensus Clustering), a methodology whose purpose is the provision of a prediction of the number of clusters in a dataset, together with a dissimilarity matrix (the consensus matrix) that can be used by clustering algorithms. As detailed in the remainder of the paper, is a natural candidate for a speed-up.ResultsSince the time-precision performance of depends on two parameters, our first task is to show that a simple adjustment of the parameters is not enough to obtain a good precision-time trade-off. Our second task is to provide a fast approximation algorithm for . That is, the closely related algorithm (Fast Consensus) that would have the same precision as with a substantially better time performance. The performance of has been assessed via extensive experiments on twelve benchmark datasets that summarize key features of microarray applications, such as cancer studies, gene expression with up and down patterns, and a full spectrum of dimensionality up to over a thousand. Based on their outcome, compared with previous benchmarking results available in the literature, turns out to be among the fastest internal validation methods, while retaining the same outstanding precision of . Moreover, it also provides a consensus matrix that can be used as a dissimilarity matrix, guaranteeing the same performance as the corresponding matrix produced by . We have also experimented with the use of and in conjunction with (Nonnegative Matrix Factorization), in order to identify the correct number of clusters in a dataset. Although is an increasingly popular technique for biological data mining, our results are somewhat disappointing and complement quite well the state of the art about , shedding further light on its merits and limitations.ConclusionsIn summary, with a parameter setting that makes it robust with respect to small and medium-sized datasets, i.e, number of items to cluster in the hundreds and number of conditions up to a thousand, seems to be the internal validation measure of choice. Moreover, the technique we have developed here can be used in other contexts, in particular for the speed-up of stability-based validation measures.

show abstract

Textual data compression in computational biology: a synopsis

Giancarlo

Scaturro

Utro

2009

View full text Add to dashboard Cite

It goes without saying that most of the research results reviewed here offer software prototypes to the bioinformatics community. The Supplementary Material provides pointers to software and benchmark datasets for a range of applications of broad interest. In addition to provide reference to software, the Supplementary Material also gives a brief presentation of some fundamental results and techniques related to this paper. It is at: http://www.math.unipa.it/ approximately raffaele/suppMaterial/compReview/

show abstract

The Boyer–Moore–Galil String Searching Strategies Revisited

Apostolico¹,

Giancarlo²

1986

SIAM J. Comput.

124

View full text Add to dashboard Cite

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.