A variety of nuclear localization signals (NLSs) are experimentally known although only one motif was available for database searches through PROSITE. We initially collected a set of 91 experimentally verified NLSs from the literature. Through iterated 'in silico mutagenesis' we then extended the set to 214 potential NLSs. This final set matched in 43% of all known nuclear proteins and in no known non-nuclear protein. We estimated that >17% of all eukaryotic proteins may be imported into the nucleus. Finally, we found an overlap between the NLS and DNA-binding region for 90% of the proteins for which both the NLS and DNA-binding regions were known. Thus, evolution seemed to have used part of the existing DNA-binding mechanism when compartmentalizing DNA-binding proteins into the nucleus. However, only 56 of our 214 NLS motifs overlapped with DNA-binding regions. These 56 NLSs enabled a de novo prediction of partial DNAbinding regions for ∼800 proteins in human, fly, worm and yeast.
Crystallization has proven to be the most significant bottleneck to high-throughput protein structure determination using diffraction methods. We have used the large-scale, systematically generated experimental results of the Northeast Structural Genomics Consortium to characterize the biophysical properties that control protein crystallization. Datamining of crystallization results combined with explicit folding studies lead to the conclusion that crystallization propensity is controlled primarily by the prevalence of well-ordered surface epitopes capable of mediating interprotein interactions and is not strongly influenced by overall thermodynamic stability. These analyses identify specific sequence features correlating with crystallization propensity that can be used to estimate the crystallization probability of a given construct. Analyses of entire predicted proteomes demonstrate substantial differences in the bulk amino acid sequence properties of human versus eubacterial proteins that reflect likely differences in their biophysical properties including crystallization propensity. Finally, our thermodynamic measurements enable critical evaluation of previous claims regarding correlations between protein stability and bulk sequence properties, which generally are not supported by our dataset. NIH Public Access Author ManuscriptNat Biotechnol. Author manuscript; available in PMC 2010 January 1. Published in final edited form as:Nat Biotechnol. 2009 January ; 27(1): 51-57. doi:10.1038/nbt.1514. NIH-PA Author ManuscriptNIH-PA Author Manuscript NIH-PA Author ManuscriptThe ability to determine the atomic structures of macromolecules represents a great achievement in molecular biology because of the unparalleled value of this information in understanding the fundamental chemistry of life [1][2][3][4][5] . While nuclear magnetic resonance represents an invaluable source of structural information, especially for small proteins, most macromolecular structures are determined using x-ray crystallography. Capitalizing on the recent proliferation of genomic sequence data, "structural genomics" consortia have been organized worldwide to develop methods and infrastructure for high-throughput protein structure determination. These groups have contributed to improvements in expression and structure determination methods 6 , and the four largest U.S. consortia accounted for 45% of all novel structures deposited in the Protein Data Bank (PDB) in 2007 7 . While these efforts contribute to the impressive progress of the structural biology community in characterizing the full repertoire of protein structures, the rate of growth in sequence information nonetheless far out-paces that of structural information. Given the ongoing acceleration of whole-genome sequencing, the gap between the two will continue to expand without a breakthrough in macromolecular structure determination methods.The systematic efforts of structural genomics projects show that crystallization is the major bottleneck to protein structure determinati...
NLSdb is a database of nuclear localization signals (NLSs) and of nuclear proteins. NLSs are short stretches of residues mediating transport of nuclear proteins into the nucleus. The database contains 114 experimentally determined NLSs that were obtained through an extensive literature search. Using 'in silico mutagenesis' this set was extended to 308 experimental and potential NLSs. This final set matched over 43% of all known nuclear proteins and matches no currently known non-nuclear protein. NLSdb contains over 6000 predicted nuclear proteins and their targeting signals from the PDB and SWISS-PROT/TrEMBL databases. The database also contains over 12 500 predicted nuclear proteins from six entirely sequenced eukaryotic proteomes (Homo sapiens, Mus musculus, Drosophila melanogaster, Caenorhabditis elegans, Arabidopsis thaliana and Saccharomyces cerevisiae). NLS motifs often co-localize with DNA-binding regions. This observation was used to also annotate over 1500 DNA-binding proteins. NLSdb can be accessed via the web site: http://cubic.bioc.columbia.edu/db/NLSdb/.
Summary One major objective of structural genomics efforts, including the NIH-funded Protein Structure Initiative (PSI), has been to increase the structural coverage of protein sequence space. Here, we present the target selection strategy used during the second phase of PSI (PSI-2). This strategy, jointly devised by the bioinformatics groups associated with the PSI-2 large-scale production centres, targets representatives from large, structurally uncharacterised protein domain families, and from structurally uncharacterised subfamilies in very large and diverse families with incomplete structural coverage. These very large families are extremely diverse both structurally and functionally, and are highly over-represented in known proteomes. On the basis of several metrics, we then discuss to what extent PSI-2, during its first three years, has increased the structural coverage of genomes, and contributed structural and functional novelty. Together, the results presented here suggest that PSI-2 is successfully meeting its objectives and provides useful insights into structural and functional space.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.