EGF domains are extracellular protein modules cross-linked by three intradomain disulfides. Past studies suggest the existence of two types of EGF domain with three-disulfides, human EGF-like (hEGF) domains and complement C1r-like (cEGF) domains, but to date no functional information has been related to the two different types, and they are not differentiated in sequence or structure databases. We have developed new sequence patterns based on the different C-termini to search specifically for the two types of EGF domains in sequence databases. The exhibited sensitivity and specificity of the new pattern-based method represents a significant advancement over the currently available sequence detection techniques. We re-annotated EGF sequences in the latest release of Swiss-Prot looking for functional relationships that might correlate with EGF type. We show that important post-translational modifications of three-disulfide EGFs, including unusual forms of glycosylation and post-translational proteolytic processing, are dependent on EGF subtype. For example, EGF domains that are shed from the cell surface and mediate intercellular signaling are all hEGFs, as are all human EGF receptor family ligands. Additional experimental data suggest that functional specialization has accompanied subtype divergence. Based on our structural analysis of EGF domains with three-disulfide bonds and comparison to laminin and integrin-like EGF domains with an additional interdomain disulfide, we propose that these hEGF and cEGF domains may have arisen from a four-disulfide ancestor by selective loss of different cysteine residues.
Structural data mining studies attempt to deduce general principles of protein structure from solved structures deposited in the protein data bank (PDB). The entire database is unsuitable for such studies because it is not representative of the ensemble of protein folds. Given that novel folds continue to be unearthed, some folds are currently unrepresented in the PDB while other folds are overrepresented. Overrepresentation can easily be avoided by filtering the dataset. PDB_SELECT is a well-used representative subset of the PDB that has been deduced by sequence comparison. Specifically, structures with sequences that exhibit a pairwise sequence identity above a threshold value are weeded from the dataset. Although length criteria for pairwise alignments have a structural basis, this automated method of pruning is essentially sequence-based and runs into problems in the twilight zone, possibly resulting in some folds being overrepresented. The value-added structure databases SCOP and CATH are also a potential source of a nonredundant dataset. Here we compare the sequence-derived dataset PDB_SELECT with the structural databases SCOP (Structural Classification Of Proteins) and CATH (Class-Architecture-Topology-Homology). We show that some folds remain overrepresented in the PDB_SELECT dataset while other folds are not represented at all. However, SCOP and CATH also have their own problems such as the labor-intensiveness of the update process and the problem of determining whether all folds are equally or sufficiently distant. We discuss areas where further work is required.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.