We show that statistical potentials and threading score functions, derived from finite data sets, are informatic functions, and that their performance depends on the manner in which data are classified and compressed. The choice of sequence and structural parameters affects estimates of the conditional probabilities P(C|S), the quantification of the effect of sequence S on conformation C, and determines the amount of information extracted from the data set, as measured by information gain. The mathematical link between information gain and mean conformational energy, established in this work using the local backbone potential as model, demonstrates that manipulation of descriptive parameters also alters the "energy" values assigned to native conformation and to decoy structures in the test pool, and consequently, the performance of such statistical potential functions in fold recognition exercises. We show that sequence and structural partitions that maximize information gain also minimize the mean energy of the ensemble of native conformations. Moreover, we establish an informatic basis for the placement of the native score within an energy spectrum given by the decoy pool in a threading exercise. We discover that, among all informatic quantities, information gain is the best predictor of threading success, even better than the standard Z-score. Consequently, the choices of sequence and structural descriptors, extent of compression, and levels of discretization that maximize information gain must also produce the best potential functions. Strategies to optimize these parameters with respect to information extraction are therefore relevant to building better statistical potentials. Last, we demonstrate that the backbone torsion potential, defined by the trimer sequence, can be an effective tool in greatly reducing the set of possible conformations from a vast decoy pool.
We examine the information-theoretic characteristics of statistical potentials that describe pairwise long-range contacts between amino acid residues in proteins. In our work, we seek to map out an efficient information-based strategy to detect and optimally utilize the structural information latent in empirical data, to make contact potentials, and other statistically derived folding potentials, more effective tools in protein structure prediction. Foremost, we establish fundamental connections between basic information-theoretic quantities (including the ubiquitous Z-score) and contact "energies" or scores used routinely in protein structure prediction, and demonstrate that the informatic quantity that mediates fold discrimination is the total divergence. We find that pairwise contacts between residues bear a moderate amount of fold information, and if optimized, can assist in the discrimination of native conformations from large ensembles of native-like decoys. Using an extensive battery of threading tests, we demonstrate that parameters that affect the information content of contact potentials (e.g., choice of atoms to define residue location and the cut-off distance between pairs) have a significant influence in their performance in fold recognition. We conclude that potentials that have been optimized for mutual information and that have high number of score events per sequence-structure alignment are superior in identifying the correct fold. We derive the quantity "information product" that embodies these two critical factors. We demonstrate that the information product, which does not require explicit threading to compute, is as effective as the Z-score, which requires expensive decoy threading to evaluate. This new objective function may be able to speed up the multidimensional parameter search for better statistical potentials. Lastly, by demonstrating the functional equivalence of quasi-chemically approximated "energies" to fundamental informatic quantities, we make statistical potentials less dependent on theoretically tenuous biophysical formalisms and more amenable to direct bioinformatic optimization.
In an effort to quantify loss of information in the processing of protein bioinformatic data, we examine how representations of amino acid sequence and backbone conformation affect the quantity of accessible structural information from local sequence. We propose a method to extract the maximum amount of peptide backbone structural information available in local sequence fragments, given a finite structural data set. Using methods of information theory, we develop an unbiased measure of local structural information that gauges changes in structural distributions when different representations of secondary structure and local sequence are used. We find that the manner in which backbone structure is represented affects the amount and quality of structural information that may be extracted from local sequence. Representations based on virtual bonds capture more structural information from local sequence than a three-state assignment scheme (helix/strand/loop). Furthermore, we find that amino acids show significant kinship with respect to the backbone structural information they carry, so that a collapse of the amino acid alphabet can be accomplished without severely affecting the amount of extractable information. This strategy is critical in optimizing the utility of a limited database of experimentally solved protein structures. Finally, we discuss the similarities within and differences between groups of amino acids in their roles in the local folding code and recognize specific amino acids critical in the formation of local structure.
We use basic ideas from information theory to extract the maximum amount of structural information available in protein sequence data. From a non-redundant set of protein X-ray structures, we construct local-sequence-dependent [phi,psi] distributions that summarize the influence of local sequence on backbone conformation. These distributions, approximations of actual backbone propensities in the folded protein, have the following properties: (1) They compensate for the problem of scarce data by an optimized combination of local-sequence-dependent and single-residue specific distributions; (2) They use multi-residue information; (3) They exploit similarities in the local coding properties of amino acids by collapsing the amino acid alphabet to streamline local sequence description; (4) They are designed to contain the maximum amount of local structural information the data set allows. Our methodology is able to extract around 30 cnats of information from the protein data set out of a total 387 cnats of initial uncertainty or entropy in a finely discretized [phi,psi] dihedral angle space (18 x 18 structural states), or about 7.8%. This was achieved at the hexamer length scale; shorter as well as longer fragments produce reduced information gains. The automatic clustering of amino acids into groups, a component of the optimization procedure, reveals patterns consistent with their local coding properties. While the overall information gain from local sequence is small, there are some local sequences that have significantly narrower structural distributions than others. Distribution width varies from at least 20% less than the average overall entropy to at least 14% above. This spread is an expression of the influence of local sequence on the conformational propensities of the backbone chain. The optimal ensemble of local-sequence-specific backbone distributions produced is useful as a guide to structural predictions from sequence, as well as a tool for further explorations of the nature of the local protein code.
To reduce complexity, understand generalized rules of protein folding, and facilitate de novo protein design, the 20-letter amino acid alphabet is commonly reduced to a smaller alphabet by clustering amino acids based on some measure of similarity. In this work, we seek the optimal alphabet that preserves as much of the structural information found in long-range (contact) interactions among amino acids in natively-folded proteins. We employ the Information Maximization Device, based on information theory, to partition the amino acids into well-defined clusters. Numbering from 2 to 19 groups, these optimal clusters of amino acids, while generated automatically, embody well-known properties of amino acids such as hydrophobicity/polarity, charge, size, and aromaticity, and are demonstrated to maintain the discriminative power of long-range interactions with minimal loss of mutual information. Our measurements suggest that reduced alphabets (of less than 10) are able to capture virtually all of the information residing in native contacts and may be sufficient for fold recognition, as demonstrated by extensive threading tests. In an expansive survey of the literature, we observe that alphabets derived from various approaches-including those derived from physicochemical intuition, local structure considerations, and sequence alignments of remote homologs-fare consistently well in preserving contact interaction information, highlighting a convergence in the various factors thought to be relevant to the folding code. Moreover, we find that alphabets commonly used in experimental protein design are nearly optimal and are largely coherent with observations that have arisen in this work.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.