An unusual pattern in a nucleic acid or protein sequence or a region of strong similarity shared by two or more sequences may have biological significance. It is therefore desirable to know whether such a pattern can have arisen simply by chance. To identify interesting sequence patterns, appropriate scoring values can be asigned to the individual residues of a single sequence or to sets of residues when several sequences are compared. For single sequences, such scores can reflect biophysical properties such as charge, volume, hydrophobicity, or secondary structure potential; for multiple sequences, they can reflect nucleotide or amino acid similarity measured in a wide variety of ways. Using an appropriate random model, we present a theory that provides precise numerical formulas for assessing the statistical significance of any region with high aggregate score. A second class of results describes the composition of high-scoring segments. In certain contexts, these permit the choice of scoring systems which are "optimal" for distinguishing biologically relevant patterns. Examples are given of applications of the theory to a variety of protein sequences, highlighting segments with unusual biological features. These include distinctive charge regions in transcription factors and protooncogene products, pronounced hydrophobic segments in various receptor and transport proteins, and statistically ignificant subalignments involving the recently characterized cystic fibrosis gene. relationships (e.g., refs. 5, 7, 8, 12). One limitation to the applicability of these results has been their inability to allow for properties or mismatches that vary in degree. For example, in describing the charge or hydrophobicity ofamino acid residues, it would be more informative to use different score levels, and when comparing sequences one may wish to count a mismatch between isoleucine and valine differently than a mismatch between glycine and tryptophan.In this paper we describe a rigorous statistical theory that provides explicit formulas for characterizing significant sequence configurations with reference to a general scoring scheme. In particular, we determine the distribution of high aggregate segment scores and the distribution of the number of separate segments of significantly high score. A second class of results deals with the letter composition of highscoring segments, which in certain contexts provides a method for choosing suitable scoring schemes. We will discuss the theory in two primary contexts: (i) the analysis of a single protein sequence with the objective of identifying segments with statistically significant high scores for hydropathy strength, charge concentration, size profile, phosphorylation potential, or secondary structure propensity; (ii)
We compare and contrast genome-wide compositional biases and distributions of short oligonucleotides across 15 diverse prokaryotes that have substantial genomic sequence collections. These include seven complete genomes (Escherichia coli, Haemophilus influenzae, Mycoplasma genitalium, Mycoplasma pneumoniae, Synechocystis sp. strain PCC6803, Methanococcus jannaschii, and Pyrobaculum aerophilum). A key observation concerns the constancy of the dinucleotide relative abundance profiles over multiple 50-kb disjoint contigs within the same genome. (The profile is XY * ؍ f XY * /f X * f Y * for all XY, where f X * denotes the frequency of the nucleotide X and f XY * denotes the frequency of the dinucleotide XY, both computed from the sequence concatenated with its inverted complementary sequence.) On the basis of this constancy, we refer to the collection { XY * } as the genome signature. We establish that the differences between { XY * } vectors of 50-kb sample contigs of different genomes virtually always exceed the differences between those of the same genomes. Various di-and tetranucleotide biases are identified. In particular, we find that the dinucleotide CpG؍CG is underrepresented in many thermophiles (e.g., M. jannaschii, Sulfolobus sp., and M. thermoautotrophicum) but overrepresented in halobacteria. TA is broadly underrepresented in prokaryotes and eukaryotes, but normal counts appear in Sulfolobus and P. aerophilum sequences. More than for any other bacterial genome, palindromic tetranucleotides are underrepresented in H. influenzae. The M. jannaschii sequence is unprecedented in its extreme underrepresentation of CTAG tetranucleotides and in the anomalous distribution of CTAG sites around the genome. Comparative analysis of numbers of long tetranucleotide microsatellites distinguishes H. influenzae. Dinucleotide relative abundance differences between bacterial sequences are compared. For example, in these assessments of differences, the cyanobacteria Synechocystis, Synechococcus, and Anabaena do not form a coherent group and are as far from each other as general gram-negative sequences are from general gram-positive sequences. The difference of M. jannaschii from low-G؉C gram-positive proteobacteria is one-half of the difference from gram-negative proteobacteria. Interpretations and hypotheses center on the role of the genome signature in highlighting similarities and dissimilarities across different classes of prokaryotic species, possible mechanisms underlying the genome signature, the form and level of genome compositional flux, the use of the genome signature as a chronometer of molecular phylogeny, and implications with respect to the three putative eubacterial, archaeal, and eukaryote domains of life and to the origin and early evolution of eukaryotes.In this report, we describe measures of genomic similarities that do not depend on prior alignment of homologous sequences and apply them to sufficiently large samples of prokaryotic genomic sequences. The approach departs from almost all other metho...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.