In this study, we developed a new mathematical method for performing multiple alignment of highly divergent sequences (MAHDS), i.e., sequences that have on average more than 2.5 substitutions per position (x). We generated sets of artificial DNA sequences with x ranging from 0 to 4.4 and applied MAHDS as well as currently used multiple sequence alignment algorithms, including ClustalW, MAFFT, T-Coffee, Kalign, and Muscle to these sets. The results indicated that most of the existing methods could produce statistically significant alignments only for the sets with x < 2.5, whereas MAHDS could operate on sequences with x = 4.4. We also used MAHDS to analyze a set of promoter sequences from the Arabidopsis thaliana genome and discovered many conserved regions upstream of the transcription initiation site (from −499 to +1 bp); a part of the downstream region (from +1 to +70 bp) also significantly contributed to the obtained alignments. The possibilities of applying the newly developed method for the identification of promoter sequences in any genome are discussed. A server for multiple alignment of nucleotide sequences has been created.
Identification of promoter sequences in the eukaryotic genome, by computer methods, is an important task of bioinformatics. However, this problem has not been solved since the best algorithms have a false positive probability of 10−3–10−4 per nucleotide. As a result of full genome analysis, there may be more false positives than annotated gene promoters. The probability of a false positive should be reduced to 10−6–10−8 to reduce the number of false positives and increase the reliability of the prediction. The method for multi alignment of the promoter sequences was developed. Then, mathematical methods were developed for calculation of the statistically important classes of the promoter sequences. Five promoter classes, from the rice genome, were created. We developed promoter classes to search for potential promoter sequences in the rice genome with a false positive number less than 10−8 per nucleotide. Five classes of promoter sequences contain 1740, 222, 199, 167 and 130 promoters, respectively. A total of 145,277 potential promoter sequences (PPSs) were identified. Of these, 18,563 are promoters of known genes, 87,233 PPSs intersect with transposable elements, and 37,390 PPSs were found in previously unannotated sequences. The number of false positives for a randomly mixed rice genome is less than 10−8 per nucleotide. The method developed for detecting PPSs was compared with some previously used approaches. The developed mathematical method can be used to search for genes, transposable elements, and transcript start sites in eukaryotic genomes.
A new mathematical method for potential reading frameshift detection in protein-coding sequences (cds) was developed. The algorithm is adjusted to the triplet periodicity of each analysed sequence using dynamic programming and a genetic algorithm. This does not require any preliminary training. Using the developed method, cds from the
Arabidopsis thaliana
genome were analysed. In total, the algorithm found 9,930 sequences containing one or more potential reading frameshift(s). This is ∼21% of all analysed sequences of the genome. The Type I and Type II error rates were estimated as 11% and 30%, respectively. Similar results were obtained for the genomes of
Caenorhabditis elegans,
Drosophila
melanogaster, Homo sapiens, Rattus norvegicus
and
Xenopus tropicalis
. Also, the developed algorithm was tested on 17 bacterial genomes. We compared our results with the previously obtained data on the search for potential reading frameshifts in these genomes. This study discussed the possibility that the reading frameshift seems like a relatively frequently encountered mutation; and this mutation could participate in the creation of new genes and proteins.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.