We present a comparative proteome analysis of the five complete eukaryoticgenomes(human,Drosophilamelanogaster,Caenorhabditis elegans, Saccharomyces cerevisiae, Arabidopsis thaliana), focusing on individual and multiple amino acid runs, charge and hydrophobic runs. We found that human proteins with multiple long runs are often associated with diseases; these include long glutamine runs that induce neurological disorders, various cancers, categories of leukemias (mostly involving chromosomal translocations), and an abundance of Ca 2 ؉ and K ؉ channel proteins. Many human proteins with multiple runs function in development and͞or transcription regulation and are Drosophila homeotic homologs. A large number of these proteins are expressed in the nervous system. More than 80% of Drosophila proteins with multiple runs seem to function in transcription regulation. The most frequent amino acid runs in Drosophila sequences occur for glutamine, alanine, and serine, whereas human sequences highlight glutamate, proline, and leucine. The most frequent runs in yeast are of serine, glutamine, and acidic residues. Compared with the other eukaryotic proteomes, amino acid runs are significantly more abundant in the fly. This finding might be interpreted in terms of innate differences in DNA-replication processes, repair mechanisms, DNA-modification systems, and mutational biases. There are striking differences in amino acid runs for glutamine, asparagine, and leucine among the five proteomes.
Several human inherited neurodegenerative diseases are triplet-repeat diseases associated with proteins containing long runs of glutamine (long CAG codon iterations; for reviews, see refs. 1 and 2). Disease severity seems to be correlated with the extent of iterations of the CAG codon above a threshold (3). Strikingly, many of the triplet-repeat disease proteins contain multiple long runs of amino acids other than glutamine. Listing all runs of lengths of at least five residues (and using the standard one-letter amino acid code), the huntingtin protein contains Q 23 , P 11 , P 10 , E 5 , E 6 ; atrophin-1 (dentatorubral pallidoluysian atrophy, DRPLA) contains Q 20 , S 7 , S 10 , P 6 , H 5 ; the androgenreceptor protein (Kennedy's disease) contains Q 26 , Q 6 , Q 5 , P 8 , A 5 , G 24 ; and the brain-voltage-dependent calcium channel protein CCAA (spinocerebellar ataxia 6) contains H 10 and Q 11 .Consequences of hyperexpansion of DNA-triplet repeats might include altered rates of transcription or translation, mRNA instability, and aberrant DNA-hairpin structures (4, 5). Protein aggregation attributed to attachment of glutamine-rich proteins to unrelated molecules may lead to inappropriate multimerization or to formation of ''polar zippers,'' in which a long stretch of glutamine residues link strands by hydrogen bonds (6 -8).The foregoing examples motivate our comparative analysis of eukaryotic proteomes focusing on proteins containing multiple amino acid runs. The complete genomes investigated are those of the Human Genome Project tentative dr...