Structures of peptide fragments drawn from a protein can potentially occupy a vast conformational continuum. We co-ordinatize this conformational space with the help of geometric invariants and demonstrate that the peptide conformations of the currently available protein structures are heavily biased in favor of a finite number of conformational types or structural building blocks. This is achieved by representing a peptides' backbone structure with geometric invariants and then clustering peptides based on closeness of the geometric invariants. This results in 12,903 clusters, of which 2207 are made up of peptides drawn from functionally and/or structurally related proteins. These are termed "functional" clusters and provide clues about potential functional sites. The rest of the clusters, including the largest few, are made up of peptides drawn from unrelated proteins and are termed "structural" clusters. The largest clusters are of regular secondary structures such as helices and beta strands as well as of beta hairpins. Several categories of helices and strands are discovered based on geometric differences. In addition to the known classes of loops, we discover several new classes, which will be useful in protein structure modeling. Our algorithm does not require assignment of secondary structure and, therefore, overcomes the limitations in loop classification due to ambiguity in secondary structure assignment at loop boundaries.Keywords: geometric invariants; protein structure comparison; secondary structure; loop
IntroductionGlobular proteins are made up of regular secondary structures such as a-helices and b-strands and non-regular secondary structures, which are referred to as loops. The regular secondary structures are defined based on a regular pattern of the backbone torsion angles for consecutive amino acid residues or periodic patterns of hydrogen bonding between the backbone NH and CvO groups.1 Loops are the regions that join the regular secondary structures and lack the regularity of torsion angles for consecutive residues. In fact, loops, polypeptide fragments from proteins, which trace a "loop-shaped" path in three-dimensional space, were considered to be random coils until recently.2 Loops are highly compact substructures and are typically situated at the protein surface. Certain loop geometries have been shown to recur across non-homologous proteins, which forms the basis for classification of loops into structural families. For example, the b-hairpin and a-turn families of loops have been described in detail.1,3 -7 A much wider classification of loops of various categories has been provided by Kwasigroch et al. 8 and by Oliva et al. 9 Loop classification has important implications in protein structure modeling and in fitting the NMR distance constraints or an X-ray crystallography electron density map to a loop geometry. On the contrary, no such fine-grained classification has been attempted for a-helices and b-strands, although detectable variations are reported in these regular secondar...