We study statistical properties of the Jensen-Shannon divergence D, which quantifies the difference between probability distributions, and which has been widely applied to analyses of symbolic sequences. We present three interpretations of D in the framework of statistical physics, information theory, and mathematical statistics, and obtain approximations of the mean, the variance, and the probability distribution of D in random, uncorrelated sequences. We present a segmentation method based on D that is able to segment a nonstationary symbolic sequence into stationary subsequences, and apply this method to DNA sequences, which are known to be nonstationary on a wide range of different length scales.
A segmentation algorithm based on the Jensen-Shannon entropic divergence is used to decompose longrange correlated DNA sequences into statistically significant, compositionally homogeneous patches. By adequately setting the significance level for segmenting the sequence, the underlying power-law distribution of patch lengths can be revealed. Some of the identified DNA domains were uncorrelated, but most of them continued to display long-range correlations even after several steps of recursive segmentation, thus indicating a complex multi-length-scaled structure for the sequence. On the other hand, by separately shuffling each segment, or by randomly rearranging the order in which the different segments occur in the sequence, shuffled sequences preserving the original statistical distribution of patch lengths were generated. Both types of random sequences displayed the same correlation scaling exponents as the original DNA sequence, thus demonstrating that neither the internal structure of patches nor the order in which these are arranged in the sequence is critical; therefore, long-range correlations in nucleotide sequences seem to rely only on the power-law distribution of patch lengths.
Small RNA research is a rapidly growing field. Apart from microRNAs, which are important regulators of gene expression, other types of functional small RNA molecules have been reported in animals and plants. MicroRNAs are important in host-microbe interactions and parasite microRNAs might modulate the innate immunity of the host. Furthermore, small RNAs can be detected in bodily fluids making them attractive non-invasive biomarker candidates. Given the general broad interest in small RNAs, and in particular microRNAs, a large number of bioinformatics aided analysis types are needed by the scientific community. To facilitate integrated sRNA research, we developed sRNAtoolbox, a set of independent but interconnected tools for expression profiling from high-throughput sequencing data, consensus differential expression, target gene prediction, visual exploration in a genome context as a function of read length, gene list analysis and blast search of unmapped reads. All tools can be used independently or for the exploration and downstream analysis of sRNAbench results. Workflows like the prediction of consensus target genes of parasite microRNAs in the host followed by the detection of enriched pathways can be easily established. The web-interface interconnecting all these tools is available at http://bioinfo5.ugr.es/srnatoolbox
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.