We explored DNA structures of genomes by means of a new tool derived from the "chaotic dynamical systems" theory (the so-called chaos game representation [CGR]), which allows the depiction of frequencies of oligonucleotides in the form of images. Using CGR, we observe that subsequences of a genome exhibit the main characteristics of the whole genome, attesting to the validity of the genomic signature concept. Base concentrations, stretches (runs of complementary bases or purines/pyrimidines), and patches (over- or underexpressed words of various lengths) are the main factors explaining the variability observed among sequences. The distance between images may be considered a measure of phylogenetic proximity. Eukaryotes and prokaryotes can be identified merely on the basis of their DNA structures.
Curvilinear component analysis (CCA) is performed by an original sell-organized neural network, which provides a convenient approach for dimension reduction and data exploration. It consists in a non-linear, preserving distances projection of a set ( quantizing vectors describing the input space. i'he CCA technique is applied to the analysis of ('GR (('hans Game Representation) fractal images of DNA sequences from different species. 'Ihe CUR method produces images where pixels represent frequency of small sequences of bases revealing nested patterns in DNA sequences.Comparisons of the results obtained using CCA, PCA (principal component analysis) and Kohonen's SUMs are carried out using several hundred of CUR images. CCA provides a good topology-preserving mapping of images, in contrast with PCA, the residual error on distances between images after projection being fbund much smaller whatever the dimensionality of the output space. Kohonens SUMs offers attractive results, which unfortunately can be sometimes impeded by their too strongly dependence on predefined constraints about neighborhood between output neurons. All 3 methods achieve interesting grouping of images, often in relation with phylogenetic characteristics of species. The results obtained with CCA and SUM make up a good basis for further phylogenetic classification.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.