Graphical representation of DNA sequences is one of the most popular techniques for alignment-free sequence comparison. Here, we propose a new method for the feature extraction of DNA sequences represented by binary images, by estimating the similarity between DNA sequences using the frequency histograms of local bitmap patterns of images. Our method shows linear time complexity for the length of DNA sequences, which is practical even when long sequences, such as whole genome sequences, are compared. We tested five distance measures for the estimation of sequence similarities, and found that the histogram intersection and Manhattan distance are the most appropriate ones for phylogenetic analyses.
Sequence comparison is one of the most fundamental tasks in bioinformatics. For biological sequence comparison, alignment is the most profitable method when the sequence lengths are not so large. However, as the time complexity of the alignment is the square order of the sequence length, the alignment requires a large amount of computational time for comparison of sequences of large size. Therefore, so-called alignment-free sequence comparison methods are needed for comparison between such as whole genome sequences in practical time. In this chapter, we reviewed the graphical representation of biological sequences, which is one of the major alignment-free sequence comparison methods. The notable effects of weighting during the course of the graphical representation introduced first by the author and co-workers were also mentioned.
We explored the possibilities of whole-genome duplication (WGD) in prokaryotic species, where we performed statistical analyses of the configurations of the central angles between homologous tandem repeats (TRs) on the circular chromosomes. At first, we detected TRs on their chromosomes and identified equivalent tandem repeat pairs (ETRPs); here, an ETRP is defined as a pair of tandem repeats sequentially similar to each other. Then we carried out statistical analyses of the central angle distributions of the detected ETRPs on each circular chromosome by way of comparisons between the detected distributions and those generated by null models. In the analyses, we estimated a P value by a simulation using the Kullback–Leibler divergence as a distance measure between two distributions. As a result, the central angle distributions for 8 out of the 203 prokaryotic species showed statistically significant deviations (P<0.05). In particular, we found out the characteristic feature of one round of WGD in Photorhabdus luminescens genome and that of two rounds of WGD in Escherichia coli K12.
By using the color-coding (CC) method, which is based on visual inspection by eyes, tandem repeats (TRs) were searched in the Yersinia pestis, Deinococcus radiodurans and Haemophilus influenzae genomes by three independent inspectors, and the detected TRs were compared to investigate the individual variations among inspectors in detecting TRs. We also compared the CC method with Tandem Repeats Finder (TRF) that is one of the algorithmic methods for searching TRs, in the detection ability of TRs, demonstrating that the CC method can get much larger number of TRs than TRF, even long TRs with much lower sequence identity.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.