A particularly voluminous dataset in molecular genomics, known as whole genome alignments, has gained considerable importance over the last years. In this paper, we propose a compression modeling approach for the multiple sequence alignment (MSA) blocks, which make up most of these datasets. Our method is based on a mixture of finite-context models. Contrarily to other recent approaches, it addresses both the DNA bases and gap symbols at once, better exploring the existing correlations. For comparison with previous methods, our algorithm was tested in the multiz28way dataset. On average, it attained 0.94 bits per symbol, approximately 7% better than the previous best, for a similar computational complexity. We also tested the model in the most recent dataset, multiz46way. In this dataset, that contains alignments of 46 different species, our compression model achieved an average of 0.72 bits per MSA block symbol.
This paper deals with the design and implementation of a data model and operations for dealing with continuously changing spatial data in Oracle 11g object-relational DBMS. The data model relies on abstract data types but we introduce modifications to the internal structure of the spatiotemporal data representations proposed in the literature, to reduce storage requirements and to enable the reuse of data during the execution of queries. We show how to implement spatiotemporal operations relying on the spatial functions released by the underlying DBMS and how to use the alternative data representations to reduce the volume of temporary data created in the evaluation of spatiotemporal operations. We also demonstrate how to use the proposed data types and operations for storage and manipulation of moving objects data using SQL. Finally, we discuss on the advantages and disadvantages of the proposed solutions.
The development and implementation of computational models to represent DNA sequences is a great challenge. Markov models, usually known as finite-context models, have been used for a long time in DNA compression. In a previous work, we have shown that finite-context modelling can also be used for sequence generation. Furthermore, it is known that DNA is better represented by multiple finite-context models. However, the previous generator only allowed a single finitecontext model to be used for generating a certain sequence. In this paper, we present results regarding a synthetic DNA generator based on multiple competing finite-context models.
In the last decade, the cost of genomic sequencing has been decreasing so much that researchers all over the world accumulate huge amounts of data for present and future use. These genomic data need to be efficiently stored, because storage cost is not decreasing as fast as the cost of sequencing. In order to overcome this problem, the most popular general-purpose compression tool, gzip, is usually used. However, these tools were not specifically designed to compress this kind of data, and often fall short when the intention is to reduce the data size as much as possible. There are several compression algorithms available, even for genomic data, but very few have been designed to deal with Whole Genome Alignments, containing alignments between entire genomes of several species. In this paper, we present a lossless compression tool, MAFCO, specifically designed to compress MAF (Multiple Alignment Format) files. Compared to gzip, the proposed tool attains a compression gain from 34% to 57%, depending on the data set. When compared to a recent dedicated method, which is not compatible with some data sets, the compression gain of MAFCO is about 9%. Both source-code and binaries for several operating systems are freely available for non-commercial use at: http://bioinformatics.ua.pt/software/mafco.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.