More than 30 organisms have been sequenced entirely. Here, we applied a variety of simple bioinformatics tools to analyze 29 proteomes for representatives from all three kingdoms: eukaryotes, prokaryotes, and archaebacteria. We confirmed that eukaryotes have relatively more long proteins than prokaryotes and archaes, and that the overall amino acid composition is similar among the three. We predicted that ∼15%-30% of all proteins contained transmembrane helices. We could not find a correlation between the content of membrane proteins and the complexity of the organism. In particular, we did not find significantly higher percentages of helical membrane proteins in eukaryotes than in prokaryotes or archae. However, we found more proteins with seven transmembrane helices in eukaryotes and more with six and 12 transmembrane helices in prokaryotes. We found twice as many coiled-coil proteins in eukaryotes (10%) as in prokaryotes and archaes (4%-5%), and we predicted ∼15%-25% of all proteins to be secreted by most eukaryotes and prokaryotes. Every tenth protein had no known homolog in current databases, and 30%-40% of the proteins fell into structural families with >100 members. A classification by cellular function verified that eukaryotes have a higher proportion of proteins for communication with the environment. Finally, we found at least one homolog of experimentally known structure for ∼20%-45% of all proteins; the regions with structural homology covered 20%-30% of all residues. These numbers may or may not suggest that there are 1200-2600 folds in the universe of protein structures. All predictions are available at http://cubic.bioc.columbia.edu/genomes.Keywords: Protein sequence analysis; analyzing entire genomes; helical membrane proteins; coiled-coil proteins; signal peptides; comparative modeling Supplemental material: See www.proteinscience.org.
Comparative genomics begins with collecting and describing.Sequencing the entire genome of the first freeliving organism, Haemophilus influenzae, opened the new era of flooding data in molecular biology . Since then, over 40 genomes have been sequenced, mostly for pathogens and model organisms. These include the first eukaryotic genome, Saccharomyces cerevisiae (1997), and the first animal genomes, Caenorhabditis elegans (1998), Drosophila melanogaster (Adams et al. 2000), and Homo sapiens (The Genome International Sequencing Consortium 2001;Venter et al. 2001). What can we learn from all the data? Like zoology and botany a century ago, we are just commencing to catalog the comReprint requests to: Burkhard Rost, CUBIC, Department of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168 Street, BB217, New York, New York 10032, USA; e-mail: rost@columbia.edu; fax: (212) 305-7932.Abbreviations: 3D structure, three-dimensional structure (i.e., coordinates of all residues/atoms in a protein); COILS, prediction of coiled-coil regions from sequence based on statistics and expert rules; ORF, open reading frame (protein predicted by genome-sequen...