The paper presents a novel approach to study a nucleotide sequence structure with respect to the chloroplast genome DNA sequence analysis. A specific frequencies distribution pattern of the consecutive triple nucleotide fragments was identified in the chloroplast genome DNA sequence, which demonstrated a non-degenerated pattern with seven clusters. Keywords: chloroplast genome, complexity, frequency dictionary, order, phase, triplet. DOI: 10.17516/1997-1389 Krutovsky et al., 2014;Bondar et al., 2015;Sadovsky et al., 2015). This sequence consisted of 122 561 symbols or letters from the four-letter alphabet of 122 561 symbols or letters from the four-letter alphabet. Neither other symbols, nor blank spaces are supposed to be found in a sequence; a sequence under consideration is also supposed to be coherent (i.e., consisting of a single piece).An identification and search of structures in DNA sequence is a main objective of mathematical bioinformatics, biophysics and related scientific fields, including computer programming and information theory. Structures observed within a sequence reveal an order and provide easier understanding of functional roles of a sequence or its fragments. A new function (or a connection between function and structure, or taxonomy) might be discovered through a search for new patterns in symbol sequences corresponding to DNA molecule.It is a commonly accepted fact that nucleotide sequences are rather inhomogeneous in terms of a structuredness that is demonstrated in this paper. In particular, any genome sequence roughly comprises two types of subsequences: coding and non-coding ones, respectively. These subsequences usually do not overlap, while their concatenation yields the . Neither other symbols, nor blank spaces are supposed to be found in a sequence; a sequence under consideration is also supposed to be coherent (i.e., consisting of a single piece). Materials and Methods ConceptFirst, we partitioned symbol sequences (that were the chloroplast genomes) for a set of overlapping fragments as long as 303 symbols (nucleotides), starting from the first symbol (nucleotide) at the sequence and then with a shifting window step of 10 symbols (nucleotides) alongside the chloroplast genome sequence. Second, for each fragment in the series described above, a special frequency dictionary was developed. Third, the ensemble of the dictionaries (that was a set of the points in the 63-dimensional Euclidian space) was clustered using the K-means technique (Fukunaga, 1990;Mirkes et al., 2013). Forth, the distribution of those fragments over an elastic map is studied Zinovyev, 2009, 2010;Gorban et al., 2008).Finally, a correlation of the fragments belonging to different classes obtained though K-means and elastic map implementation to the functionally charged regions of the genome is studied. Sequence dataThe chloroplast genome sequences were Lattice and DictionaryIn earlier studies (Bugaenko et al., 1996(Bugaenko et al., , 1997(Bugaenko et al., , 1998Hu and Wang, 2001), it was demonstrated tha...
The paper presents a novel approach to infer a structuredness in a set of symbol sequences such as transcriptome nucleotide sequences. A distribution pattern of triplet frequencies in the Siberian larch (Larix sibirica Ledeb.) transcriptome sequences was investigated in the presented study. It was found that the larch transcriptome demonstrates a number of unexpected symmetries in the statistical and combinatorial properties.Keywords: nucleotide sequence complexity, frequency dictionary, order, Larix sibirica, Siberian larch, symmetry, transcriptome, triplet. DOI: 10.17516/1997-1389-2015 . For our further analysis we also assumed that neither other symbols, nor blan spaces are supposed to be found in a sequence; a sequence under consideration is also suppose to be coherent (i. e. consisting of a single piece).. For our further analysis we also assumed that neither other symbols, nor blank spaces are supposed to be found in a sequence; a sequence under consideration is also supposed to be coherent (i. e. consisting of a single piece).We studied an order and structuredness over a set of sequences from finite alphabet Key idea in our search for a structure and order in a set of symbol sequences (transcriptome nucleotide sequences) is to translate sequences into their frequency dictionary (Bugaenko et al., 1996(Bugaenko et al., , 1997(Bugaenko et al., , 1998Hu and Wang, 2001). There could be a number of various definitions of a frequency dictionary, but we will use the basic one that is a list of all the strings of a given length accompanied with a frequency of each string (a detailed description is given below). It is crucial that the transformation of a symbol sequence into a frequency dictionary allows us to map a set of sequences into a metric space. The latter provided us with powerful and extended tools for analysis.We will briefly outline the concept of our study and then demonstrate the main results obtained. First, we changed each symbol sequence (that is a nucleotide sequence in the Siberian larch transcriptome set) into a frequency dictionary. Then, we studied distribution of those dictionaries in a multidimensional space trying to infer any regularities and clusters.Second, for each clustering we checked for stability of clustering. This clustering was carried out using the K-means technique.Third, we compared the statistical properties of the clusters identified by K-means and found that these clusters demonstrated a very strong symmetry in terms of the statistical properties.In brief, the clusters showed extremely low level of discrepancy in the Chargaff's second parity rule. This low discrepancy is the most intriguing fact concerning the properties of the studied transcriptome sequence set. Materials and Methods Transcriptome nucleotide sequence dataThe transcriptome Surely, this part of the transcriptome requires special studies. Frequency DictionaryPreviously (Bugaenko et al., 1996(Bugaenko et al., , 1997(Bugaenko et al., , 1998Hu and Wang, 2001), a frequency dictionary was proposed to be a fun...
New method is proposed to identify clusters in datasets. The method is based on a sequential elimination of the longest distances in dataset, so that the relevant graph looses some edges. The method stops when the graph becomes disconnected.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.